# Full text of "Business Statistics"

## See other formats

DO m< OU_1 60390 >m CD OSMANIA UNIVERSITY LIBRARY %. I, > * / f? C $ I* Gall No. * 3> * * * Q v 7 $ Accession No Authoi fide This book should be returned on or before the date last marked below. \ '^v,*A >T , ^< . ^ \ J\ - A* , !' > v ^i^- 1 "- * *' A BUSINESS STATISTICS by MARTIN A. BRUMBAUGH, Ph.D. Professor of Statistics, University of Buffalo and LESTER S. KELLOGG, M.A. Assistant Professor of Statistical Research Ohio State University with the collaboration of IRENE J. GRAHAM, M.A. Laboratory Assistant t Department of Statistics University of Buffalo 1950 RICHARD D. IRWIN INC CHICAGO COPYRIGHT 1941 BY RICHARD D. IRWIN . INC. ALL RIGHTS RESERVED THIS BOOK OR ANY PART THEREOF MAY NOT BE REPRODUCED WITHOUT THE PERMISSION OF THE PUBLISHER FIRST EDITION First Printing September, 1941 Second Printing August, 1942 Third Printing February. 1946 Fourth Printing August, 1946 Fifth Printing November. 1946 Sixth Printing May, 1947 Seventh Printing October, 1947 Eighth Printing February, 1949 Ninth Punting September, 1949 Tenth Printing ... . October, W>() PRINTED IN THB UNITED STATES OF AMERICA PREFACE A" STATISTICAL METHODS have been gradually expanded in recent years, textbook writers on the subject have exhibited a noticeable tendency to increase the amount of advanced material at the expense of the elementary content. The authors of this book hold to the opinion that the first requisite of an adequate structure is a sound foundation. Pursuant to this point of view they have attempted to place unusual emphasis upon the elementary or foundation methods of the subject. No attempt has been made to attain the research frontiers of any phase of statistical analysis. In short the aim is to present statistical materials and methods that are in everyday use in the conduct of business affairs. The readers of statistical texts might be divided into two broad groups: those who will compile, analyze, and interpret statistical data and those who will be the users of the results prepared by the first group. The latter readers comprise much the larger group and for their use either as students or business practitioners this book provides the essentials of method and the mental conditioning so necessary for effective "statistical consumption." The training requirements of the first group, the statistical producers, differ somewhat according to the level at which they engage in statistical work. Those few who are conducting advanced research work have little need Jar This book. The larger number who either now or in the future contemplate engag- ing in the usual type of statistical collection, analysis, and presentation carried on within business concerns, statistical organizations, and gov- ernmental agencies will find that the contents of this book provide sound guidance. The student will quickly discover that the methods of statistics form a related whole; that the division into chapters is for convenience in the classroom rather than for the separation of disconnected subject matter. The structure is built literally method upon method like the bricks in a wall and, to carry the analogy a step farther, the binding mortar is reasoning rather than memory. The student who attempts to acquire statistical knowledge solely by memorizing rules and for- mulas invariably fails to develop the power to apply his skill to prac- tical problems. On the other hand the student who approaches the rr PREFACE subject with an eternal "Why?" and insists on having his curiosity satisfied has a much better chance of developing the power needed to solve new problems as they arise. This distinction of attitude is exemplified by a criticism often leveled at the writers of statistical texts: "If they would only tell us what techniques to use in analyzing different kinds of data they could save so much time and take most of the mystery out of the subject." If the quoted suggestion could be put into effect, statistical practice would in truth be reduced to a routine level. But the case is not so simple because the type of analysis that should be applied to a particular set of data depends entirely on the purpose of the analysis, the specific use that will be made of the results, the time and funds that are available, and other related considerations. Therefore neither in this book nor in any other will rules be found relating methods of analysis to data on specific subjects. The function of the statistician must always be the exercise of judgment as to the proper methods to employ in the investigation of a given problem. From the standpoint of mathematics, this book assumes only that the reader has proficiency in arithmetic and enough knowledge of algebraic symbolism to be able to substitute values in a formula. Even this modest assumption is partially reinforced by the introduction early in the book of a chapter entitled "The Use of Numbers," the content of which is partially a review of arithmetic. The explanations in the book assume that the reader possesses some familiarity with economics and understands in an elementary way the organization and functioning of individual business concerns. A knowledge of accounting principles, marketing principles, and recent business history including the relation of government to business provides a desirable although less essential background. The subject matter of this book can be covered in a ninety-hour course. By reducing the intensity of coverage somewhat, the entire content can be included in a sixty-hour course. For briefer courses some chapters will undoubtedly have to be omitted entirely or in part. It is difficult to specify which chapters might be omitted, since each case requires a knowledge of the point of view of the instructor, the capabilities of the particular group of students, and the purpose of the statistics course in the curriculum. The problems appearing at the end of each chapter have been so Banned that the student who prepares answers to all or a large part PREFACE T of them will be forced to apply the important principles of the subject to situations akin to those found in statistical practice. The authors hold steadfastly to the opinion that effective teaching of statistics requires the liberal use of problems to insure facility in computation, to provide constant practice in selecting and adapting methods to particular situations, to develop proficiency in the interpretation of the results of investigation, and to encourage accurate reporting of com- pleted analyses. Many of the problems have been reduced in size to accommodate them to the needs of students, but an attempt has been made to avoid simplification to the point of absurdity. A list of references is appended to each chapter. These lists are intended to be selective rather than comprehensive. The publications named are included because they contain material which supplements the development in the text or because they offer the opportunity for either more intensive or more advanced study of the subject matter in the text. The reference lists are not intended to set forth the sources from which assistance was drawn for this book. Wherever such assist- ance is specific, direct footnote reference has been made to the source and the author's permission for use has been secured. For aid in the broader sense the authors are permanently indebted to writers in the field of statistics in so many directions that anything beyond general acknowledgment of the obligation would be impossible. If, by some mischance, the authors have failed to acknowledge materials reproduced from other writers, such omission is wholly unintentional. It is impossible to make a complete acknowledgment to Miss Irene Graham, who has collaborated with the authors in the preparation of this text. Her most outstanding work was the writing of chapters XIII and XIV and major revision of chapter XI. She has also con- tributed materially to the text of chapters IV to X, XV, XVII, XIX, and XXX, in addition to research, criticism, reorganization and editing of all chapters. Dr. Robert Riegel, professor of statistics, University of Buffalo, has read and criticized the manuscript in various stages of preparation. His suggestions concerning statistical soundness and pedagogical desir- ability represent a contribution that the authors acknowledge gratefully. Chapter XXVI has been made possible through the co-operation of Mr. A. H. Robinson, assistant-treasurer, and Mr. Lawrence M. Tarnow, head of the planning department of Eastman Kodak Company of Rochester, New York. To them for their assistance, and to the Eastman vi PREFACE Kodak Company for permission to use the material, the authors express their sincere thanks. Most of the graphs were prepared by Mr. Ralph Lownie, a student in the School of Business Administration, University of Buffalo. Their quality is due entirely to his craftmanship. The remaining graphs were drawn by Mrs. Dorothy Tallman, who also shared with Mrs. Ruth Carroll the extremely laborious task of typing the manuscript. M. A. BRUMBAUGH LESTER S. KELLOGG September, 1941 TABLE OF CONTENTS (.11 AFTER PAGE I. STATISTICS IN BUSINESS l The Statistical Approach. The Work of the Statistician. II. THE USE OF NUMBERS 13 Introduction. The Fundamental Operations. Fractions. Square Roots. Accuracy of Statistical Data. Summary. III. STATISTICAL INVESTIGATION 43 The Character of Statistical Investigation. The Canons of Statistical Investigation. Steps in Statistical Investigation. The Scope of Dif- ferent Investigations. Summary. IV. PRELIMINARY PLANNING OF INVESTIGATIONS 53 Introduction. Define the Problem. Study the Problem. Plan the Procedure. Prepare a Statement of the Program. (v^ SAMPLING 69 Relation to Knowledge. The Importance of Sampling. The Prin- ciple of Statistical Regularity. The Two Problems of Sampling. s ^^ yl.y COLLECTION OF DATA DIRECT SOURCES 92 Description of Direct Sources. Collecting Data From Direct Sources. Summary. VII. EDITING AND PRELIMINARY TABULATION 118 Editing Schedules. Preliminary Tabulation. VIII. TABULATION 144 Definitions. Types of Tables. Established Practice in the Construc- tion of Tables. Tabular Forms. IX. CLASSIFICATION OF LIBRARY SOURCES 186 The Meaning of Collection From Library Sources. Methods of Classifying Sources. Appendix A Selected Sources Listed According to Publishing Agency, Title, Frequency of Publication, and Contents. X. THE USE OF LIBRARY SOURCES % .... 210 Introduction. Finding a Good Source. The Correct Use of Data. Appendix B Examples of Search for Data in Libraries. XI. RATIOS 229 The Importance of Ratios in Statistics. Construction of Statistical Ratios. Presentation of Ratios. Comparisons Between Ratios. viii TABLE OF CONTENTS CHAPTER PAGE XII. APPLICATIONS OF RATIOS 274 Refined Ratios. Compound Ratios. Examples of the Use of Ratios in Business. XIII. GRAPHS 298 Introduction. Simple Types of Graphs Methods and Purposes of Each. Introduction to Two-Dimensional Linear Graphs. XIV. GRAPHS (continued) 323 Time Series Graphs. Planning Graphs for General Effect. \ XV. FREQUENCY DISTRIBUTIONS AND GRAPHS 350 Frequency Distributions. Graphs of Frequency Distributions. XVIJ MEASURES OF CENTRAL TENDENCY AVERAGES OF CALCULA- TION 387 Introduction. The Arithmetic Average. The Geometric Average. XVII. MEASURES OF CENTRAL TENDENCY AVERAGES OF POSITION 413 The Median. The Mode. Criteria for Selecting and Judging Aver- ages. XVIII. MEASURES OF DISPERSION AND SKEWNESS 436 Introduction. Dispersion. Skewness. Uses of Measures of Disper- sion and Skewness. XIX. INDEX NUMBERS 461 Introduction. Kinds of Index Numbers. Basic Methods of Con- structing Index Numbers. Symbols. Unweighted Formulas. Weighted Formulas. Problems of Index Number Construction. Tests of Index Numbers. Specific Uses of Index Numbers and Their Interpretation. XX. SOME COMMONLY USED INDEXES 506 Introduction. An Index of Cost of Living. An Index of Industrial Production. An Index of Employment. A Wholesale Price Index. A Local Business Index. XXI. ANALYSIS OF TIME SERIES 537 A Change in Emphasis. Components of Time Series. The Problem of Time Series Analysis. Preliminary Analysis. XXII. TREND 560 The Location of Trend. Methods of Measuring Trend. Why Trend is Measured. Summary. XXIII. SEASONAL AND CYCLICAL MOVEMENTS 592 Introduction. The Nature of Seasonal Variation. The Concept of a Seasonal Pattern. Methods of Measuring Seasonal Variation. The Cyclical Remainders. Special Seasonal Problems. TABLE OF CONTENTS ix CHAPTER PAGE XXIV. SUMMARY OF THE ANALYSIS OF TIME SERIES (An Example) 627 Adjustment of Calendar Variation. Adjustment of Changes in the Price Level. Adjustment of Seasonal Variation. Adjustment of Trend Moving-Trend Method. The Cyclical Fluctuations. XXV. INDEXES OF BUSINESS CONDITIONS 649 Need for External Information. Two Types of Business Indicators. Construction of Composite Indexes. Examples of the Construction of Composite Indexes. Local Indexes. Use of Business Indexes. XXVI. INTERNAL APPLICATION OF TIME-SERIES ANALYSIS PRODUC- TION PLANNING 682 Introduction. General Discussion of Production Planning. Two Examples of Planning. Summary. XXVII. CORRELATION 704 Introduction. Scattergram. The Regression Line. The Standard Error of Estimate. The Coefficient of Correlation. Some Deferred Points. The Rank Difference Measure of Correlation. Correlation of Time Series. Recapitulation of Formulas. XXVIII. THE NORMAL CURVE 752 Probability. Binomial Distribution. The Normal Curve. Other Types of Distributions. Goodness of Fit Chi-Square Test. / ' XXIX. PRINCIPLES OF SAMPLING AND TESTS OF SIGNIFICANCE . . 785 Introduction. The Basis of Sampling. Tests of Significance. Small Samples. Variance Analysis. XXX. PRESENTATION OF THE RESULTS OF STATISTICAL INVESTIGA- TION 826 Introduction. The Writer Reader Relation. Requirements of a Report. The Form of a Report. APPENDIX A SELECTED SOURCES LISTED ACCORDING TO PUBLISHING AGENCY, TITLE, FREQUENCY OF PUBLICATION, AND CONTENTS 197 APPENDIX B EXAMPLES OF SEARCH FOR DATA IN LIBRARIES 224 APPENDIX C LOGARITHMS OF NUMBERS 845 FIVE PLACE TABLE OF LOGARITHMS . . . 857 APPENDIX D TABLE OF SQUARES, SQUARE ROOTS AND RE- CIPROCALS 875 APPENDIX E TABLE OF ORDINATES 895 APPENDIX F TABLE OF AREAS 896 FIGURES FIGURE PAGE 1. Two Examples of the Development by Successive Steps of the Solution for Extracting Square Root 24-25 2. Schedule Used in a Real Estate Survey 95 3. Proposed Revision of Real Estate Schedule, Figure 2 .... 96 4. Questionnaire Used in Worsted Spinning Spindle Inventory . . 101 5. Radio Section of Questionnaire Used in Surveying the College Market 105 6. Letter with Appeal for Reply Based on Co-operation 106 7. Letter with Appeal for Reply Based on Interest 107 8. Letter with Appeal for Reply Based on Profit 107 9. Letter with Appeal for Reply Based on Obligation 108 10. Letter with Appeal for Reply Based on Position 108 11. Letter Based on Compulsion 109 12. Example of Instructions to Collecting Agents 110 13. Collection Card Used in Residential Vacancy Investigation in Buffalo, New York 123 14. Tally Sheet for Recording Residential Vacancy in Buffalo, New York 126 15. Proposed Work Sheet for Questionnaire Used in College Market Investigation, Figure 5 128 16. Punched Cards for Mechanical Tabulation 132 17. Schedule Used in Collecting Data for the President's Conference on Home Building and Home Ownership, 1930 134 18. Instructions for Coding Data on Home Building and Home Owner- ship in Buffalo, New York, 1930 135 *19. Code Sheet Used in Transferring Information from Questionnaire, Figure 17 138 20. Reproduction of the Printed Record from the Tabulating Machine with Headings Added 140 21. Form for Two- Way Cross-Classification 157 22. Form for Three- Way Cross-Classification 157 23. Form for Four- Way Cross-Classification 158 24. Form for Five- Way Cross-Classification 159 25. Reproduction of Department of Agriculture Form C E. 1-128 . . 171 26. Reproduction of Department of Agriculture Long-Term Blank . . 172 27. Reproduction of Department of Agriculture Form C E. 1-139 . . 173 xi xii FIGURES FIGURE PAGE 28. Eastman Kodak Co. Form Comparison of Sales by Divisions . . 175 29. Eastman Kodak Co. Form Lost Time Report 176-177 30. Instructions on the Reverse Side of Lost Time Report . . . 178-179 31. Eastman Kodak Co. Form Labor Turnover Report 181 32. Organization Chart of the Government of the United States between 192-193 33. Classification of Comparisons Between Ratios of Like Items with Examples of Each 256 34. Classification of Comparisons Between Ratios of Unlike Items with Examples of Each 264 35. Types of Graphs 299 36. Dot Maps: Filling Stations in the United States, 1935 .... 305 37. Cross-Hatched Ratio Map: Filling Stations per 10,000 Persons in the United States, 1935 308 38. Flow Map: United States Exports, 1931 311 39. Dial Chart: Index of Industrial Activity as of May 31, 1941 . . 313 40. Pictogram: Number of Workers in Basic Fields of Employment, 1940 314 41. Types of Bar Graphs 316 42. Bar Graph of Time Series 323 43. Band Graphs of Time Series: Per Cents and Dollar Values . . 325 44. Line Graphs of Time Series 328-329 45. Construction of the Ratio Scale 333 46. Curves Showing Changing Relative Rates on a Ratio Scale . . 336 47. Types of Lines 341 48. Methods of Plotting Time Periods 345 49. Tally of Monthly Rents Paid by 155 Families in a Consumer Survey in Columbus, Ohio 353 50. Array of Rents Paid by 155 Families in Columbus, Ohio . . . 354 51. Methods of Designating Class Limits 361 52. Two Types of Frequency Diagram of Rent Data 367 53. Frequency Diagrams of Discrete Data. Number of Dresses Sold in Junior Sizes 370 54. Ogives: Cumulative Frequency Diagram of Rent Data . . . . 373 55. Frequency Diagrams of Hourly Wage Rates Paid by Fifty-two Industrial Concerns 376 56. Per Cent Comparison of Two Distributions of Rent Data ... 378 57. Types of Curves 380 58. Lorenz Curves: Cumulative Per Cents of Stores and Sales, Inde- pendent Retail Grocery Stores in Buffalo, 1929 and 1935 . 383 FIGURES xiii FIGURE PAGE 59. Location of the Median in an Array 414 60. Location of the Median in a Frequency Distribution .... 416 61. Summary of Characteristics of Measures of Central Tendency . 432 62. Guide to the Suitability of Measures of Central Tendency Accord- ing to the Condition of the Data 432 63. Comparison of Columbus Rentals with Normal Distribution Ac- cording to Measures of Dispersion 449 64. Fractions of the Area of the Normal Curve Measured by the Standard Deviation 451 65. Summary of Criteria of Measures of Dispersion 454 66. Sources of Commonly Used Index Numbers 465 67. Nomograph for Reading Per Cents of Increase or Decrease in Index Numbers 499 68. National Industrial Conference Board Index of Cost of Living, Monthly, 1923-40 514 69. Federal Reserve Board Index of Industrial Production, Monthly, 1919-40 521 70. Indexes of Employment and Pay Rolls, Monthly, 1919-40 . . 526 71. Wholesale Price Indexes of National Fertilizer Association and United States Bureau of Labor Statistics; Annually, 1929-35; Monthly, 1936-39; Weekly, January 1940-April 1941 ... 531 72. Index of Bank Debits in Canton, Ohio, Monthly, 1926-40 . . 535 73. Production of Wheat, Passenger Automobiles and Anthracite Coal in the United States, 1900-1937, and Free Hand Trend for Each Series 540 74. Consumption of Raw Cotton in the United States 1913-37 . . 543 75. Monthly Sales of F. W. Woolworth Company, 1930-37 ... 546 76. Daily Net Currency Movement in New York City to or from the Federal Reserve Bank of New York, April to September, 1926 548 77. Monthly Totals and Daily Averages for 1936 for Three Sets of Data in Buffalo, New York: Sales of a Drug Store, Flour Milled and Bank Clearings 553 78. Raw Cotton Exported by the United States 1913-22 555 79. Investment in Inventory of Swift and Company Packers, 1913-36 557 80. Moving Averages Fitted to Controlled Data Containing Cycle and Straight Line Trend 564 81. Moving Average Fitted to Controlled Data Containing Cycle and Curved Line Trend 565 82. Number of Horsepower of Diesel Engines Installed Annually, 1918-37 568 riv FIGURES FIGURE PAGE 83. Diagram Used to Write the Equation of a Straight Line . . 572 84. Parabola Trend Fitted to Postal Receipts at Buffalo, New York, 1920-33 579 85. Logarithmic Trend Fitted to Production of Wood Pulp, 1923-37 . 580 86. Straight Line Trend Fitted to the Number of Lines of Magazine Advertising, 1913-37 584 87. Relative Cycles of Magazine Advertising, 1913-37 585 88. Straight Line Trend Fitted to Electric Power Production, 1919-29 588 89. Daily Average Consumption of Small Cigarettes, Monthly, 1927-36 595 90. Approximate Method: Test for Seasonal Pattern of Relatives of Annual Averages 598 91. Moving Average Method: Test for Seasonal Pattern of Relatives of Moving Averages 602 92. Link-Relative Method: Test for Seasonal Pattern of Relatives of Preceding Month 605 93. Ratio-to-Trend Method: Test for Seasonal Pattern of Relatives of Trend 610 94. Seasonal Patterns of Cigarette Consumption According to Four Methods 613 95. Relative Cycles of Cigarette Consumption, Monthly, 1927-36 . . 619 96. Successive Steps in the Analysis of a Time Series 637 97. Relative Cycles of Industrial Stock Prices and Commercial Paper Rates, Monthly Data, 1919-37 655 98. Annalist Index of Business Activity 668 99. Babson Chart of Business Conditions 673 100. Buffalo Index of Business Activity 676 101. Eastman Kodak Co. Planning Chart for Product "S" .... 693 102. Eastman Kodak Co. Planning Chart for Product "C" .... 691 103. Three Scattergram Patterns 706 104. Freehand Regression Line Showing Relation Between Prices and Earnings per share of Common Stock of Twelve Chemical Manufacturers 707 105A-B-C Regression Line Fitted by Least Squares Method to Prices and Earnings per share of Common Stock of Twelve Chemical Manufacturers 709-12-13 106. Standard Error of Estimate and Standard Deviation of y, Prices in Relation to Earnings Per Share of Common Stock of Twelve Chemical Manufacturers 718 107. Rates Charged by Banks for Customer Loans in Eight Northern and Eastern Cities Exclusive of New York City, and the Yield of Aaa Bonds, with Straight Line Trend for Each Series, 1919-37 737 FIGURES xv FIGURE PAGE 108. Curves of Three Binomial Expansions Compared with Normal Curve 757 109. Normal Curve Plotted by Calculating Values of Ordinates . . . 764 110. Binomial and Normal Distributions Fitted to Frequency Distribu- tion of Monthly Cost of Electric Current 768 111 A. Binomial Frequency Distributions, N (f-f-^) 10 , for Various Values of q and p when N = 100 772 11 IB. Binomial Frequency Distributions, N ( <?+/>) n f r Various Values of n when q = .9, p = .1, and N = 100 773 112. Diagram for Finding Values of P Associated with Computed Values of v 2 and N m 778 /v 113. Diagram for Finding Values of 2P Associated with Computed Values of / and N m 807 114. Diagram for Finding the Value of z Associated with 2P = .05 for a Given N! m l and N 2 m 2 812 115. Diagram for Finding the Value of z Associated with 2Pz=.01 for a Given N! m l and N 2 m^ 813 CHAPTER I STATISTICS IN BUSINESS THE STATISTICAL APPROACH WHEN MASSES of numerical information are to be analyzed some means of summarization must be found which will focus attention upon their major characteristics. Statistical methods have been developed to meet this need; hence in a broad sense the statistical approach is essentially a process of classification, subclassification, and cross-classification designed to give meaning to a mass of information by separating it into comparable parts. Statistical methods therefore are useful in any field of knowledge in which the recording of events produces masses of numerical information. The more important fields are psychology, sociology, education, medicine, biology, public affairs, economics, and business. Statistical Data Distinguished from Abstract Numbers Not all numbers are statistics. A table of logarithms is not a statisti- cal table, but simply a compilation of abstract numbers obeying a fixed law. On the other hand statistical data are concrete numbers represent- ing objects or measurements grouped according to stated characteristics. For example in Table 1 sales of low-priced automobiles are classified by make of car and by year of production. This double classification permits comparison of sales of the three makes in any year and the TABLE 1 SALES OF PASSENGER AUTOMOBILES DURING THE MODEL YEAR 1937-39: THREE MAKES IN THE LOW-PRICED FIELD* MAKE OF MODEL YEA* AUTOMOBILE 1937 1938 1939 Chevrolet 804,350 465,403 577,986 Ford 807 258 345,244 456,792 Plymouth 500,503 268,436 387,452 Total 2,112,111 1,079,083 1,422,230 GompUed from Tht Annalist, Vol. 49, No. 1258, p. 350; Vol. 51, No. 1310 p. 304; Vol. 53, No. 1362, p. 305; Vol. 55, No. 1418, p. 433. 2 BUSINESS STATISTICS comparison of sales of each with the total. The changes in indi- vidual and total sales from year to year can also be read from the table. Statistics deals with numbers not merely as such, but as the expres- sion of a quantitative or qualitative relationship of the concepts with which they are associated. Statistical work is for the most part a mat- ter of expressing these relationships in the best form, and of finding new relationships. Thus the comparisons observed in Table 1 might be facilitated by the computation of per cent distributions and index numbers. The development of such techniques of analysis forms a major part of the content of subsequent chapters. Statistics in the Field of Business While the statistical procedures useful in the several fields of knowl- edge are in the main identical, those procedures must be adapted to the particular types of information found in each field. The use of ratios is a basic method of analysis common in all types of statistical work but the emphasis on different kinds of comparison varies mark- edly from one field to another. In vital statistics the study of death rates leads to the development of crude rates, specific rates, standardized rates, and corrected rates. On the other hand business data require per cent relations, per cents of change, per capita ratios, per cent dis- tributions, and index numbers. Whether used in vital statistics or business statistics the word "ratio" implies a relation between two items one of which is the numerator and the other the denominator. But the examples cited show the variation in usage and suggest the extent to which subject matter determines what type of ratio comparisons will be emphasized. The relation of emphasis to subject matter can be illustrated further by considering time-series analysis. The business statistician spends a major fraction of his time in separating time series into their several components primarily to segregate the cyclical fluctuations. In such fields as medicine, psychology, education, and biometry the techniques of time-series analysis are relatively unimportant and when used seldom have as a goal the study of cyclical fluctuations. In this illustration, as in the preceding one, subject matter and purpose determine to a large extent the form of use and the importance of a particular method of analysis. STATISTICS IN BUSINESS 3 These examples should be sufficient to indicate that a development of statistical methods applicable to a particular field of knowledge becomes more specific than a general presentation, and therefore is better suited to the needs of those interested in that field. Consequently this book is devoted in the main to a presentation of statistical methods and operating techniques suitable for the analysis of masses of numeri- cal information arising in the field of business. The word ' 'business" is taken to include the aggregate of activities involved in transforming raw materials into finished consumable products and transferring goods at all stages of the process. The usual divisions of the field are production, marketing, financial operations, and transportation and communication. Such activities as legal advis- ing, accounting, and technical research including statistical work have not been listed as divisions of business because they are adjunct to all phases of business operation. Statistics in particular may assist in the .solution of problems arising in any part of the business field but has its greatest usefulness when large masses of numerical information are to be analyzed. The following illustrations of the uses of statistical methods in the four main divisions of business enterprise should give sufficient evi- dence of the pervasiveness of the statistician's work. Production Preparation of production schedules Determination of distribution of sizes in manufacture of flats, shoes, suits, dresses, etc. Analysis of time and motion studies Cost analyses Marketing Determination of sales areas and sales quotas Study of effectiveness of advertising Relation of size of orders to net profits Relation of mark-downs of goods to buying policies Financial operations Ratio analysis of financial statements by banks to determine the credit risk of prospective borrowers Determination of the average discount rate of customer loans of a bank Transportation and communication Ratio analysis of railroad traffic to determine operating efficiency, operating density, etc. Study of relative costs of moving freight by truck and by rail Study of telephone and telegraph message density 4 BUSINESS STATISTICS THE WORK OF THE STATISTICIAN Types of Statistical Work The type of problem with which a statistician deals is determined by his location in the economic structure. If he is employed by a busi- ness concern his work consists mainly in the analysis of problems arising within the concern and his data are usually the records of the concern's operations. The extent and character of statistical work carried on within any individual concern depend upon the type of business and the funds available. There are many firms that do not maintain sepa- rate statistical departments, but which conduct statistical analysis as an adjunct to the main function of one or several departments. Con- siderable information concerning the variation in statistical practice in different concerns can be obtained from a survey made by the National Industrial Conference Board. The Conference Board survey, which was begun in June, 1939, reveals that no uniform practice is followed in the organization of research. Only 30% of the companies maintain a separate centralized department for such work, or place the responsibility in the hands of a single statistician or economist. Ten per cent of the concerns assigned such research to a single executive. The majority, about 60%, divided the task among several executive offices and departments. Somewhat greater centralization of research is found in financial and public service companies than in manufacturing concerns. The greatest volume of work appears to be done in the accounting and controllers' departments, and the second heaviest volume falls on the executive offices. Next in importance come the sales division, the production department and, in fifth rank, the centralized statistical or economic research department. Most companies compile data for internal use on sales and orders, pro- duction, employment and purchases. Less than one-fourth of the companies reporting, however, attempt to compile data on inventories in the hands of distributors of their products. Most companies also attempt to forecast future trends in sales, production, costs, inventory requirements, prices and profits. More than half of the organizations reported that they attempt to forecast sales by geographic regions. Forty-two per cent of the companies carrying on research compile periodic reports on the outlook for the particular industry in which they are engaged, and nearly as many compile data on the prospects for business in general. About 15% also prepare studies on general business conditions as they affect purchasers of their finished products and suppliers of their raw materials. Other special studies carried on to a considerable extent by private industry, listed in the order of their importance, are the economic effects of taxation, STATISTICS IN BUSINESS 3 analysis of departmental operations, personnel practices, plant layout, effects of legislation, and feasibility of plant expansion. 1 While this study shows the ramifications of statistical work in business it does not emphasize the variety of problems encountered by an indi- vidual statistician. He may be asked to make cost analyses and fore- casts of production for the manufacturing department, analyses of time and motion studies for the plant scheduling department, studies of employment, payroll, and wages for the personnel department, sales quotas for the sales department, estimates of plant burden for the main- tenance department, studies of bad debt losses for the credit depart- ment, or an investigation of the relation between selling prices, sales volume, and turnover of inventory for the president. The variety of subjects included in this list demonstrates the scope of the work of the statistician employed by an individual concern. He must have considerable familiarity with the operations carried on in all departments of the concern and an understanding of the economic prin- ciples involved in those operations, in addition to a working knowledge of statistical methods. Problems of a different type are dealt with by a statistician engaged in independent research or employed by a trade association, a commer- cial research agency, a government bureau, or a university research de- partment. Much of the information used in this kind of statistical analysis is gathered from the records of individual concerns or agencies. The data are therefore of essentially the same nature as those used by each individual concern in analyzing its own problems, although the purpose of the analysis is different. Table 2 is an illustration of a study that makes use of the records of a number of concerns. The Bureau of Business Research of the Harvard Graduate School of Business Administration maintains a regular reporting service through which it receives annual reports of operations from a large number of department stores in all sections of the country. This table gives a summary of the turnover rates computed from the reports of 430 stores. The stores are divided into ten groups according to size as measured by annual sales. The purpose of making this classification is to group together stores operating under conditions that are as nearly similar as possible. * Commercial and Financial Chronicle, Vol. 150, No. 3894 (February 10, 1940), p. 904 (New York: William B. Dana Co.), reproduced from a National Industrial Con- ference Board Report. BUSINESS STATISTICS TABLE 2 TURNOVER OF GOODS IN DEPARTMENT STORES OF DIFFERENT SIZES IN THE UNITED STATES, 1938 * Sizi OF STORK As MEASURED BY ANNUAL SALES NUMBER OF STORKS REPORTING AVERAGE TURNOVER OF GOODS Less than $150,000 54 2.1 150,000- 300,000 45 2.7 300,000- 500,000 58 3.6 500,000- 750,000 35 3.6 750,000- 1,000,000 28 4.2 1,000,000- 2,000,000 62 4.2 2,000,000- 4,000,000 58 4.4 4,000,000-10,000,000 57 4.7 10,000,000-20,000000 20 4.7 20,000,000 or more 13 3.4 * Malcolm P. McNair, "Operating Results of Department and Specialty Stores in 1938," Bureau oj Business Research Bulletin Number 109 (May, 1939), Boston: Graduate School of Business Administration, Harvard University. The turnover is computed by dividing the annual sales by the aver- age inventory. The increase in the turnover, as size of store increases, indicates that the smaller stores maintain larger inventories in relation to sales than the larger stores. This skeleton fact gives rise to many questions related to the analysis of department-store operations. For example, one might theorize as follows: the smaller stores must keep in stock a line of goods practically as inclusive as that maintained by larger stores; however, in smaller stores, demand for many types of goods is only occasional, whereas those same goods are in constant demand in larger stores; consequently the maintenance of this slow- moving stock reduces the turnover of the smaller stores. The testing of this hypothesis would be a task for the statistical staff that has access to the reports of the individual concerns. Further examples of the type of research undertaken by statisticians working with the records of individual concerns are: the relation of advertising costs to sales; the seasonal variation in automobile sales in different parts of the country; the relation of bank loans to size of banks and population of cities in which the banks are located; and the rates of interest charged for installment credit according to type of goods purchased. In other cases the data used in research work do not come from business concerns but from markets, individuals, or the results of prior statistical work. Some illustrations are the construction of an index of the general price level, a study of the preferences of consumers for competing products, and an analysis of the relation STATISTICS IN BUSINESS 7 of the alternations of prosperity and depression to the sales of con- sumers' goods and producers' goods. The studies mentioned in preceding paragraphs give some indication of the difference in type of problem encountered by statisticians work- ing for individual concerns and by those who engage in business re- search in some other capacity. Specific techniques that are important in one type of work may be less so in the other, but a common body of method is used in either case. Since the primary purpose of this book is to present a systematic development of statistical method, no occasion will arise for keeping the two types of statistical problems separate. Examples will be drawn from either to illustrate the discus- sion of methods employed in both. Statistical Background of Business Activity The extent to which business operations depend upon a background provided by statisticians is not generally realized. This statement is equally applicable to every technical specialist who is a part of the business structure, but the obscurity of the statistician's contribution is particularly striking because of the wide ramifications of his work. The nature of the work of the statistician can be explained with the aid of some published statements concerning business affairs. Example 1. "Wages [in 1938]" Swtft and Company Year Book (1938), p. 26: Since 1923 the average hourly wage rate for Swift & Company's Chicago plant workers has increased by 52 per cent, while the number of hours in the basic working week has been reduced from 48 to 40. Actual weekly earnings per worker are about 37 per cent greater than in 1923. Taking into account the changes in living costs, these weekly earnings provide Swift & Company plant workers with approximately 57 per cent higher "real" wages than they received in 1923. The statistical department of Swift and Company presumably main- tains employment records containing average hourly wages, the number of hours worked per week, and the average weekly earnings per worker in 1923 and in 1938. The computation of the per cents of increase is, of course, routine work. Indexes of the cost of living are published by the United States Bureau of Labor Statistics and the National Indus- trial Conference Board. A comparison of weekly wages of Swift and Company employees with a cost of living index gives the increase in real wages. 8 BUSINESS STATISTICS Example 2. "Your Food Supplies and Costs," Consumers? Guide, Vol. V, No. 10 (October 24, 1938), p. 16: EGGS. Supplies are expected to continue smaller than a year earlier during the remainder of 1938, but in 1939 supplies probably will be bigger than in the current year. Relatively small stocks of storage eggs, coupled with smaller fresh egg production than in 1937 have been the major factors behind the larger than usual price upswing this year. Storage stocks are an important source of supply during the last quarter of the year when fresh egg production reaches its lowest level. Current storage stocks are almost a third under a year ago. Continuation of the present rate of increase in prices would result in peak egg prices in November considerably above their 1937 level and might result in the highest prices since 1930. There is some possibility, however, that fresh egg production may comprise a larger than usual proportion of total supplies during the last two months of the year because of the large hatchings this spring. This condition would offset part of the price boosting effect of small storage stocks. Retail egg prices went up 5 cents a dozen from August to September and were a cent a dozen higher than last September. The statistical background for this analysis of egg prices has been supplied by the United States Department of Agriculture. Local offices of the department in all parts of the country send regular reports to Washington concerning conditions in their areas. The analysis of these reports by the statistical division provides information concerning the supply of eggs for the latter part of 1938 and early 1939, the size of cold storage holdings, the prices of eggs during the year, and the prospective supply of egg-laying pullets. Previous studies of the depart- ment afford a basis for the statement that "storage stocks are an impor- tant source of supply during the last quarter of the year when fresh egg production reaches its lowest level." Comparisons of current re- ports with department records show absolute and relative price changes from earlier months as well as earlier years. Example 3. Buffalo Evening News (November 22, 1938), p. 29: YULE SALES MAY EQUAL $1,200,000,000 OF 1937 NEW YORK, Nov. 22 (AP). A busy Christmas shopping season was foreseen today by the National Retail Dry Goods Association. An analysis by its accounting experts, the association reported, indicated dollar sales in department and apparel specialty stores of the nation in the four weeks preceding Christmas may approximate $1,200,000,000, about the same as in the comparable 1937 period. STATISTICS IN BUSINESS 9 Actually, the number of items traded across store counters, it was pointed out, may exceed last year's Christmas trade because department store prices this year are about 7 per cent lower on the average. As stated in the article the accountants have estimated that sales during the Christmas season of 1938 will be very satisfactory, but the main point is the work of the statistician which is back of the innocent- looking statement that prices of department store goods are about 7 per cent lower than they were in 1937. This conclusion is probably based on the "Index of Prices of Department Store Goods" prepared monthly by A. W. Zelomek and published in Fairchild Publications. This index is based on prices of 105 nonstyle items collected monthly from 53 retail-trade organizations. Example 4. "The Trend of Business," Dun's Review, Vol. 46, No. 2121 (May, 1938), pp. 30-31: On the charts, the present state of business activity bears some resemblance to that of 1934. A few of the more important measures national income, department store sales, wholesale prices, construction contracts are still above early 1936 levels, considerably higher than in 1934. On the other hand, indus- trial production is down to the 1934 average; primary distribution, measured by railroad carloadings, is the lowest since November, 1934; the Annalist index of business activity for March, the lowest since November, 1934; the Times average of 50 stock prices for the first three weeks of April, the lowest since September, 1934. The first sentence indicates that charts have been prepared by statisti- cians showing the course followed by various indicators of business conditions in recent years. The computation of the national income requires the continuous attention of a corps of statisticians in the United States Department of Commerce. Department-store sales are reported by over 400 individual stores to the Federal Reserve Banks of die districts in which the stores are located. Indexes of sales are prepared for each district as well as for the United States as a whole. Wholesale price indexes are prepared by a number of statistical agen- cies, but the most widely used index is that of the United States Bureau of Labor Statistics computed by an elaborate technique and based on prices of over 800 commodities. Data on construction con- tracts are collected by the F. W. Dodge Corporation through local offices and correspondents in 37 states east of the Rocky Mountains. An Index of Industrial Production is published by the Board of Gov- ernors of the Federal Reserve System. The research staff prepares the 10 BUSINESS STATISTICS index by the application of elaborate statistical techniques to data com- piled from trade journals, reports of trade associations and government bureaus. Railroad car loadings are collected from individual railroad companies and prepared for publication by the Car Service Division of the Association of American Railroads. The Annalist index of business activity is a cyclical index corrected for trend and seasonal variation by an involved statistical process. The New York Times average of 50 stocks is a product of the newspaper's research staff. Articles similar to these are presented every day to the reading pub- lic and they exert a widespread influence over the conduct of business affairs. These four examples give some indication of the variety of the activities of business statisticians and of the multiplicity of methods and techniques they employ. The orderly development of basic meth- ods and techniques and their relation to various business activities become the subject matter of a textbook in statistics. PROBLEMS 1. What distinguishes statistical data from abstract numbers? 2. Apply this distinction to the following; give reasons for your answer in each case: A NUMBERS SQUARES SQUARE ROOTS RECIPROCALS 51 2601 7.1414 .019608 52 2704 7.2111 .019231 53 2809 7.2801 .018868 54 2916 7.3485 .018519 55 3025 7.4162 .018182 B REPORT OF OvBRfiMB WORKED BY LOCAL BRANCHES OF A LABOR UNION AMOUNT OF OVERTIME No. OF LOCALS None 2 Occasionally 3 Never more than 6 hours per week 1 When necessary 3 Five hours regularly _2^ Total 13 C ADDITIONS TO TERRITORY OF CONTINENTAL UNITED STATES AFTER 1783 TERRITORY DATE OF ADDITION Northwest Territory 1787 Louisiana Purchase 1803 Florida 1819 Texas 1843 Oregon 1846 Mexican Cession 1848 Gadsden Purchase 1833 STATISTICS IN BUSINESS 11 HOURLY UNSKILLED HIRING WAGE RATE OF A GROUP OF MANUFACTURING CONCERNS IN 1936 CONCERN HOURLY WAGI RATE (in Cents) A 32 B 36 C 30 D 40 E 35 3. What are the differences between the study of general statistics and busi- ness statistics? 4. Why is statistics not listed as a division of business activity? 5. List the differences in function of the statistician employed by a private concern and one employed by some other type of organization. 6. In the preparation of which of the following reports would the statistician initially compiling the data be employed by an industrial concern? a) Monthly production of automobiles and trucks by General Motors. b) Weekly freight car loadings of coal in the United States. c) Daily bank clearings of the Buffalo Clearing House Association. d) The monthly production of crude oil in the United States. e) The net profits of the Erie Railroad for the first six months of 1930. /) The daily messages carried by the New York Telephone Company. g ) The number of airplanes arriving and departing at the Buffalo airport. h) Number of bank employees: NAME OF BANK EXECUTIVES CLERKS Export 23 180 First National> 14 200 etc Total 212 1325 7. Describe the statistical material found on the financial pages of an urban newspaper. Be sure to give exact reference to the issue and edition of the paper. 8. Select from a current publication an article similar to the examples in the text. State what work has been done by statisticians in the preparation of the article. Give exact reference. REFERENCES BROWN, THEODORE H., "Problems Met by Companies That Instruct Their Employees in Statistical Methodology," Journal of the American Statistical Association, Vol. XXVII, No. 181 A (March, 1933, Supplement), pp. 10-14. A suggested program of statistical work that could be carried on within a business concern. 12 BUSINESS STATISTICS BURGESS, ROBERT W., "The Whole Duty of the Statistical Forecaster," Journal of the American Statistical Association, Vol. XXVII, No. 181A (March, 1933, Supplement), pp. 636-42. The first part of this article includes some excellent examples of the types of analysis made by statisticians. FALKNER, ROLAND P., "The Scope of Business Statistics," Quarterly Publica- tions of the American Statistical Association, Vol. XVI, No. 122 (June, 1918), pp. 24-29. An earlier attempt to define business statistics that is still pertinent. HATHAWAY, WILLIAM A., "Internal and External Statistical Needs of American Business," Quarterly Publications of the American Statistical Association, Vol. XVI, No. 122 (June, 1918), pp. 1-15. Contains background essential to an understanding of modern statistical development PARMALEE, JULIUS H., "The Utilization of Statistics in Business," Quarterly Publications of the American Statistical Association, Vol. XV, No. 117 (June, 1917), pp. 565-76. An early statement of the types of statistical analysis available to business men. YOUNG, BENJAMIN A., Statistics as Applied in Business. New York: The Ronald Press Company, 1925. Chapters I to VIII contain a very complete statement of the character of internal and external statistical work and a detailed presentation of types of statistical problems. CHAPTER II THE USE OF NUMBERS INTRODUCTION IN BUSINESS practice there is an increasing trend toward the expression of ideas in numerical form. The manager of a store no longer reports that business is improving, but that sales last month were 16 per cent better than in the corresponding month of last year. The banker no longer relies solely upon his personal judg- ment in granting loans, but uses a set of ratios, derived from the finan- cial statement of a prospective borrower, to aid him in determining the concern's credit standing. A similar tendency toward more precise methods can be found in all parts of the business structure. In no small degree this tendency accounts for the increasing demand for a knowl- edge of statistical methods. On the other hand teachers in various parts of the country have remarked that young men and women of college age show a decline in ability to carry out numerical operations and particularly an increas- ing inability to think in numerical terms. It is not the function of a textbook in statistics to reverse this tendency. The fault is too funda- mental for that. This condition does explain, however, why it is desir- able to pause for a brief statement concerning methods of computation before proceeding with the development of statistical techniques. The necessary computation which accompanies statistical work consumes a vast amount of time. The greater part of such computation is purely repetitive in character; consequently methods of shortening the time spent in doing it will allow more of the student's time to be spent in studying statistics and less in practicing arithmetic. Hence the following pages are devoted to a review of arithmetic operations. THE FUNDAMENTAL OPERATIONS Addition Computations should be performed rapidly. The advantage in speed lies not merely in the time saved, but mainly in the confidence 13 14 BUSINESS STATISTICS gained for those who waver and in the attention preserved for those whose minds might wander. There is some advantage here in illustrat- ing the wrong method. Suppose that the following columns are to be added: 2641 362 570 1369 8147 4216 2164 All too commonly in the author's experience the student's mind goes through the following steps: adding up from the bottom, 4 and 6 are 10, 10 and 7 are 17, 17 and 9 are (then 9 are told off on the fingers) are 26, 26 and 2 are (I think I'll go to lunch after this class). Let me see, I was adding something. Oh yes, 4 and 6 are 10, etc. To eliminate this wandering the addition should be done at the maximum speed possible, naming only the successive sums. So, first column, 10, 17, 26, 29; second column, 9, 19, 26, 36; third column, 6, 10, 15, 24; fourth column, 8, 16, 19. There is everything to be gained and nothing to be lost by performing addition at a rate of speed which will leave no time to worry about an impending lunch hour. There are students who can be busily engaged for as much as 2 minutes in adding these columns of figures, whereas 15 seconds is the maximum time which should be spent. Those who have difficulty with the amount to be carried from one column to the next may prefer to write the total of each column separately as indicated below. This method is advantageous if one is likely to be interrupted. Subtraction The results of subtraction should always be checked by adding the subtrahend and the remainder. THE USE OF NUMBERS 15 38642 minuend () 12966 subtrahend 25676 remainder (+) 12966 subtrahend 38642 check Subtraction of numbers can be performed mentally by using a method of excess and deficit in relation to a round number such as 100. In subtracting 69 from 118, the minuend is 100+ 18, the subtrahend is 100 31; therefore the difference is 18 + 31 = 49. Other similar examples are: 263 = 200 + 63 547 = 500 + 47 186 = 200 14 490 = 500 10 77 remainder 57 remainder 1245 = 1000 -f- 245 2317 = 2200 + 117 893 = 1000 107 2049 = 2200 151 352 remainder 268 remainder After this method has been mastered, it will not be necessary to put any of the computation on paper. With sufficient practice the method can be used in fairly complicated operations. For example, 12286 = 12000 -f 286) , . 4260 = 4000 + 260 - 1143= 1000 + 143 } subtractm g -2749 = 3000 251 11000 -f- 143 = 11143 1000 + 511 = 1511 Multiplication The greatest saving of time in multiplication results from the use of short cuts. These are derived from well-known principles of arith- metic and algebra as indicated in the examples of various methods which follow. The Use of Reciprocals. 1 One number multiplied by another is the same as the first number divided by the reciprocal of the second. 1. 763X5 =If. = ^ = 3815 2. 1582 X 25 = ^ 2 ^ _ 39550 5. 220X50 = 4. 17228 X 125 = 1722 8 8000 = 2153500 5. 15415 X .16J = i5p = 2569.17 1 The reciprocal of a number is defined as unity divided by the number, i.e., the reciprocal of 5 is 1 -5- 5 = .2. The reciprocal of 40 is 1 -f- 40 = .025. The reciprocal of .25 is 1 -h .25 = 4. 16 BUSINESS STATISTICS Multiplier Near Ten or a Power of Ten. 6. 27 X 99 = 27(100-1)= 27(100) 27(1)= 2700 27 = 2673 7. 366 X 1001 = 366000 + 366 = 366366 8. 2746 X 11 = 27460 + 2746 = 30206 Squaring Numbers Ending in Five. 2 9. 25* =(2X3) 100 + 25 = 625, i.e., 2(2 + 1) and annex 25 10. 75 2 = (7 X 8) 100 + 25 = 5625, i.e., 7 (7 -f 1 ) and annex 25 11. 105 2 = (10X 11)100 + 25 = 11025 i.e., 10(10+ 1) and annex 25 12. 405 2 =(40 X 41) 100 + 25 = 164025, i.e., 40(40 + 1) and annex 25 Last Digits Totaling Ten? 13. 67 X 73 = (70 3) (70 + 3) = 4900 9 = 4891 14. 95 X 105 = (100 5) (100 + 5) = 10000 25 = 9975 15. 89 X HI = (10 11) (100 + 11) = 10000 121 = 9879 16. 4.1 X 2.9 = (3.5 + .6) (3.5 .6) = 12.25 .36 = 11.89 17. 640X5.6 =100 (6.4X5.6) = 100 (6 + .4) (6 .4) = 100(36 .16) = 100 X 35.84 = 3584 A Method of Squaring Any Number. 4 18. 72* = (72 2) (72 + 2) + 2* = (70 X 74) + 4 = 5180 + 4 = 5184 19. 153* = (153 3) (153 + 3) + 3 2 = (150 X 156) + 9 = (100X156) + (50 X 156) + 9 = 15600 + 7800 + 9 = 23409 Division , There are a few worthwhile short cuts in division, based mainly on the use of reciprocals. 5 Commonly used among these are, 20. 5725 -f- 25 = 5725 X -04 = 57.25 X 4 = 229 21. 280400 -f- 50 = 2804 X 2 = 5608 22. 12750 -f- 500 = 12.75 X 2 = 25.5 23. 245925 -f- 125 = 245.925 X 8 = 1967.4 It is to be expected that the reader who employs the few short cuts listed here will develop many more to aid him as he progresses in the * To square any number ending in 5 multiply the part of the number to the left of 5 by one more than itself and annex 25 to the product. 8 Examples 13 to 17 make use of the algebraic identity (a -f b) (a )= * 2 b*. 4 The same algebraic identity is used as in the preceding examples but the form is changed to a * = (a -f- b) (a b) -f b*. 5 The use of reciprocals changes division to multiplication whereas the use of reciprocals on a preceding page changed multiplication to division. THE USE OF NUMBERS 17 use of numbers. There is no standard set which can be recommended for the use of everyone. Each person should employ those which result in time saving for him and come to mind naturally. Just as every person has an individual style in writing so every person will develop an individual set of short cuts in computation. The Order of Performing the Fundamental Operations When the operations of addition and subtraction are employed in a problem, the order of performing them makes no difference in the result. Thus: 50 + 275 36 + 5 210 = 84 or 275 364-50 210 -f 5 = 84 or 210+ 50 36 + 275+ 5=84 The introduction of parentheses into such a series indicates that the operations within the parentheses must be performed first. There will be no difference in the result when the sign preceding a parenthesis is plus, but when it is minus, it has the effect of reversing the sign of every figure inclosed. Thus: 69 63 + 58 10 = 54 69 (63+ 58) 10 =69 121 10 = 62 69 (63+ 58 10) =69 111 = 42 but 69 63 +(58 10)= 69 63 + 48 = 54 When the operations of multiplication and division are employed in a problem, the order of performing the division does alter the result; hence in order to avoid ambiguity, it is necessary to inclose in paren- theses the figures that are intended to be used together as numerator and denominator. Thus: (250 -f- 10) X 2 = 25 X 2 = 50 but 250 + (10 X 2) = 250 -i- 20 = 12.5 If several signs of grouping are used in the same problem, the rule is "Work from the inside out. Thus: -5 [{26 (36 + 9) } -r- 52] = -5 [{26 X 4} -r- 52] = -5 [104 ^ 52] = -5X2 = -10 When multiplication or division or both appear in a problem along with addition or subtraction or both, with no parentheses, the opera- 18 BUSINESS STATISTICS tions of multiplication and division must be performed first. If paren- theses are introduced, the rules already quoted will apply. Thus: 550 + 10 X 7 60 + 5 = 550 -f 70 12 = 608 (550 + 10) X (7 60) -*. 5 = 560 X (53) -r- 5 == 29680 -4- 5 = 5936 (550 0- 10) X [7 - (60 + 5)] = 560 X [7 - 12] = 560 X (-5) = - '.800 FRACTIONS Sometimes computations are carried on in common fractions and sometimes in decimals. It is desirable therefore to know how to per- form operations with both and how to convert one into its equivalent in the other. Common Fractions Addition and Subtraction. To acid J, i, and i, the common denomi- nator must first be found. The common denominator is the small- est number divisible by the individual denominators, in this case 2, 3, and 5. By inspection this is 30; there is no smaller number divisible by 2, 3, and 5. The three fractions with the common denomi- nator 30 are . + + $ = f O r !,&. Suppose that the four fractions 3if, 5Ai 2Af and 8f were to be added. When, as in this case, the common denominator is not evident by inspection, the general method of finding it is to reduce all of these individual denominators to their prime factors and take the product of the prime factors appearing in the reduction. The form for finding the common denominator of the given fractions is: Divisors Denominators 2 36 n 20 9 2 18 15 10 9 3 9 15 5 9 3 3 5 5 3 5 1 5 5 1 The process consists in dividing by 2 as long as any of the denominators are divisible by 2. Then do the same with 3 and so on using only THE USE OF NUMBERS 19 prime 6 numbers as divisors until unity is reached in each column. When any denominator is not divisible in any row, it is simply carried along until a divisor is used by which that denominator is divisible. The common denominator will be the product of the prime divisors on the left, i.e., 2X2X3X3X5 = 180. The four fractions ex- pressed in terms of the common denominator are, 3^r + 5-nnr + Subtraction of fractions is performed by the same process of reduc- ing to a common denominator. For example, in subtracting 6^- from 13& both fractions should be changed to the common denominator 90. Then ijft - 6 = 12-V7- - 6fJ = 6U = 6ft- Multiplication. The type of multiplication problem most common in statistical work involving common fractions is similar to that met by the bookkeeper in extending bills or inventories, e.g., 4f dozen shirts at $16J per dozen. Reduce each number to an improper fraction: 4f = -^ and 16 = *& ; then the total value of the shirts would be 29 v .3 3 29 _ yx^X 1 1 _ 20X11 __ 3 1 9 -_ C7Q 3 . ~ff~ A ^ 2 *,&* 2 4 -- i M>'"T Division. The rule for finding the value of the quotient of two fractions is: invert the fraction in the denominator and multiply. Some examples are: f-i 4ft - ii 2 T 5 TX7f Decimal Fractions The preceding paragraphs have dealt with problems containing common fractions. It is equally necessary to be conversant with meth- ods of dealing with decimal fractions. In fact decimals are more frequently employed in statistical work than common fractions. The use of calculating machines requires the expression of fractional amounts decimally. It is necessary therefore that statisticians acquire facility in handling both common and decimal fractions and be able to convert one to the other automatically. To convert a common frac- tion to a decimal the numerator is divided by the denominator, i.e., 1=1.0-*- 5 =.2 f = 3.000 -5- 8 = .375 = 5.000 -^ 140 = .0357 e A prime number is one that is divisible only by unity and the number itself. 20 BUSINESS STATISTICS Decimal Fractions and Per Cents. When any number is expressed as a decimal or a per cent, it simply means that the numerator of a common fraction is written, the denominator being understood without writing it. Decimals mean a certain part of one unit, while per cents mean a certain part of 100 units. Thus, .5 means five-tenths of one unit, or one-half, while 50 per cent means 50 of 100 units. Obviously i of every one and 50 of every hundred are merely two ways of expressing the same relation; hence we say that .5 is equivalent to 50 per cent. The rule is: To express a decimal as a per cent move the deci- mal point two places to the right. The reverse rule is: To express a per cent as a decimal move the decimal point two places to the left. It is important for the statistician to be able to change from com- mon fractions or decimals to per cents and the reverse with accuracy and without consuming much time in the process. Table 3 gives a list of equivalents that can be referred to until their use becomes automatic. TABLE 3 LIST OF COMMON FRACTIONS WITH THFIR DECIMAL AND PER CENT EQUIVALENTS COMMON FRACTION EOUIVAI ENT DlXIMAL EOUIYALKNT PER CI-M COMMON FR \CJION Eonv\LENr DECIMAL EQUIVALENT PER CENT TWIT . . .001 .1 js. .625 62.5 ir&iT . TOtf . .002 .0025 .2 .25 7_ * .875 .166- 87.5 16.66- 3&S .00333 .33-- S 6 .833- 83.33- Tffas .004 .4 i . . .2 20. a ^ Q .005 .5 I .4 40. T*TT .00625 .625 I .6 60. ToU" .0075 .75 * .8 80. TW .01 1. i . .25 25. irihr .Olfr 1 5 3 4 .75 75. -ks .... .02 2. a .... .33- 33. 33-- ?V .025 2 5 \ . .. .66 66.66 T&tf .03 3. * . . .5 50. A- .0333 3.33- ^ . . 1 5 150. & .04 4. * . . 1 33- 133.33- TiV .... .05 5. -i ... 1 25 125. TV .0625 6.25 if . . 1.75 175. TV .066 6.66- 2 2 2 220. TV .0833 8 33-- 31 3 625 362.5 T7 .... .4166 41.66- 4^ ... 4.875 487.5 T V .5833- 58.33-" 5e- ... 5 833- 583-33- H .9166 91.66- 8rV 8 1 810. i .125 12.5 lOTff . 10 3125 1031.25 .375 37.5 12/TT 12 35 1235. THE USE OF NUMBERS 21 Calculation of Per Cents. The three terms of a per cent calculation are: (1) the base, b, (2) the rate, r, and (3) the percentage, p. The fundamental relation is b X r = p. Given any two of these terms the third one can be found from the fundamental relation. There are, therefore, three types of problems which arise. Each of these is illus- trated in the examples which follow: Example 1 : How much is 5 per cent of 12420? This means that 5 of every hundred in 124.20 hundreds are to be counted, so 5 X 124.20 = 621, hence 5 per cent of 12420 is 621. The simpler way of doing the same thing is to multiply the given number, 12420 by .05 = 621. That is, instead of taking 5 out of every hundred in the original number, take .05 out of every one in the orig- inal number. Example 2: How much is 364 per cent of 1250? 1250 X 3.64 = 4550. Example 3: How much is 750 increased by 40 per cent of itself? 750 -f (750 X -4) = 750 + 300 = 1050, or 750 X 1-4 = 1050. Example 4: How much is f of 4875? f per cent of 4875? 4875 X -4 = 1950. 4875 X -00 4 = 19-50. p-r-r = b Example 5: 450 is 75 per cent of what number? If 75 per cent of a certain number is 450, then 1 per cent of the number is- of 450 or 6. If 1 per cent of a number is 6, then 100 per cent of the number is 100 times 6, or 600. Therefore the number is 600. The work can be shortened as follows: 450-r-.75=600. Example 6: 375 is 12 J per cent of what number? 375 -f- .125 = 3000. Another solution would use the 12 per cent as 4. The problem would then read, 375 is \ of what number? The number is 375 X 8 = 3000. Example 7: 12500 is f of what number? If 12500 is | of the number, then i of the number would be i of 12500 or 3125. If i of the number is 3125, then f or 100 p*r cent 22 BUSINESS STATISTICS of the number would be 3 times 3125, or 9375. Therefore the number is 9375. The pencil and paper solution would be, 12500 -f- 1.3333 = 9375 plus a remainder. This remainder is, of course, due to the use of an approximate divisor. The advantage of the common fraction solution is obvious. Example 8: What per cent of 24 is 3? This problem may be worked in two ways, a) 3 is | of 24 or 12i per cent of 24. b) $ + 24 = .125 or 12 per cent. Example 9: What per cent of 8100 is 17415? 17415-7-8100 = 2.15 or 215 per cent. Although the wording of a problem may somewhat obscure the case, all per cent calculations can be expressed in one of these three forms. With sufficient practice in dealing with per cent problems no difficulty should be encountered in determining which of the three forms is to be used. SQUARE ROOTS The five commonly used methods of determining square roots of numbers are (a) by inspection, (b) by arithmetic, (c) by the use of logarithms, (d) by the use of a table of square roots, and (e) by the use of a slide rule. Only the first and second methods will be discussed at this time. The use of logarithms is explained in Appendix C. A table of square roots has been provided in Appendix D. The use of a slide rule can be learned best from the manual of instructions pro- vided by the manufacturers of slide rules. By Inspection The approximate value of the square roots of many numbers can be ascertained by a process of mental interpolation, if one has at com- mand the values of the square roots of a few numbers or makes use of the short-cut method of squaring numbers ending in five. The inspection method can be explained easily by the use of an example. Suppose the square root of 457 were wanted. Twenty squared is 400 and twenty-five squared is 625. The square root of 457 is somewhere between 20 and 25, but it is obviously closer to 20. The difference THE USE OF NUMBERS 23 between 400 and 625 is 225, 57 is approximately one-fourth of this amount, therefore the square root of 457 is approximately 20 + i(5) or 21.25. The correct value is 21.38, hence the value by inspec- tion is not even correct to one decimal place. A variation of the method used in the preceding example will yield better results if a calculating machine is available. Suppose the square root of 12750 were wanted. The square of 105 is 11025 and the square of 110 is 12100 and the square of 115 is 13225. The square root of 12750 is between 110 and 115 but is a little closer to 115. Therefore square 113 on the calculating machine, securing 12769- This is rather close, but the next step, if more accuracy is wanted, is to square 112.9. The result 12746 is still closer and further trials will give 112.92 as the correct root to five figures. The inspection method yields good results quickly after a little practice. Even though it is not used as a method of finding the square root, it is valuable as a checking device when other methods are em- ployed. This is particularly true when roots are found by logarithms, the slide rule, or a square root table. By Arithmetic Computation When no auxiliary devices are available and accurate results are required square roots can be found by the following steps. Step 1. Divide the number into groups of two digits each way from the decimal point. The last group on the left will contain only one digit if the number has an odd number of digits to the left of the decimal point. Step 2. The largest number whose square does not exceed the value of the digit or pair of digits in the left-hand group of the number is the first figure of the root. This figure is entered above the left hand group. Step 3. Subtract the square of the first figure of the root from the left group of the number. Step 4. At the right of the remainder of Step 3, annex the figures in the second group of the number. This is the new dividend. Step 5. Double the root already found and annex one zero to the right as a trial divisor and divide it into the dividend of Step 4 to obtain the second figure of the root which is entered over the second pair of digits. This new figure will often be too high and must be corrected by trial and error, 24 BUSINESS STATISTICS Step 6. The new figure is added to the trial divisor of Step 5 to give the true divisor which is then multiplied by the new figure of the root to give a product which is entered under the dividend of Step 4. Step 7. This product is subtracted from the dividend, the next group of digits is annexed on the right of the reminder, and the process of Step 5 and Step 6 is repeated. Two examples of the use of these seven steps in extracting the square root are shown in Figure 1. The examples are constructed to emphasize the growth of the solution as the successive steps of the process are applied to the examples. The complete solution for future use is given in the last computation at the bottom of the Figure. FIGURE 1 Two EXAMPLES OF THE DEVELOPMENT BY SUCCESSIVE STEPS OF THE SOLUTION FOR EXTRACTING SQUARE ROOT Find the square root of 12750 Steps 1, 2 nd 3 I 1 1 27 50. J. Find the square root of 4693.49 LJ 46 93.49 56 10 Steps 1, 2, 3 and 4 Steps 1, 2, 3, 4 and 5 Steps 1, 2, 3, 4, 5 and 6 I 1 1 27 50. 1 27 1 1 I 1 1 27 50. 1 20 27 21 I 21 I 6 46 93.49 36 10 93 I 6 9 46 93.49 36 120 | 10 93 I 6 9 46 93.49 36 120 129 10 93 11 * This product is too large, hence the new root should be 8, as shown below: | 6 8 46 93.49 36 120 10 93 8 I 1281 10 24 THE USE OF NUMBERS 25 FIGURE 1 Cont. Two EXAMPLES OF THE DEVELOPMENT BY SUCCESSIVE STEPS OF THE SOLUTION FOR EXTRACTING SQUARE ROOT Steps 1, 2, 3, 4, 5, 6 and 7 1112 1 27 50. 1 20 27 1 21 21 220 6 50 2 222 4 44 I 6 8. $ 46 93.49 36 120 8 10 93 24 128 10 1360 5 69.49 1365 | 68.25 Steps 5, 6 and 7 repeated Steps 5, 6 and 7 repeated 1 1 1 2. 9 1 1 20 " 1 21 1 27 27 21 50 .00 220 2 6 4 50 44 222 2240 9 2249 2 2 06 02 .00 41 1 1 1 2. 9 1 27 1 50.00 00 20 1 27 21 21 220 2 6 4 50 44 222 2240 9 2 2 06. 02 00 41 2249 22580 1 3. 2. 59 25 00 81 22581 | 6 8. 5 Of 46 93 36 .49 00 120 8 128 10 93 10 24 1360 5 69 68 .49 25 1365 13700 1 1.24 00 f 13700 will not go into 12400, therefore the next fig- ure of the root is and the next group of figures is brought down. I 6 8. 5 9 46 36 93.49 93 24 00 00 00 00 30 81 120 10 8 128 10 1360 5 69.49 68.25 1365 137000 9 1.24 137009 1.23 69 19 1.33 19 * The square root of 12750 is 112.92. (The last digit of the root is increased to 2 because the remainder is more than half of the last divisor.) .'.The square root of 4693.49 is 68.509 26 BUSINESS STATISTICS The example on the right shows how to find the new figure in the root by trial and error in Step 5 and how to deal with a zero in the root. There will be one digit to the left of the decimal point in the root for every pair of figures to the left of the decimal point in the original number. Likewise there will be one digit to the right of the decimal point in the root for every pair of figures appearing in or added to the original number to the right of the decimal point. The last statement is particularly important in taking the square root of numbers less than one. ACCURACY OF STATISTICAL DATA The question of how many figures shall be retained in the result of a computation is particularly important in statistical work because many of the data employed are to some degree approximate. The problem of the statistician can be explained by contrasting his compu- tations with those of the bookkeeper who is engaged in keeping a record of numerical facts in dollars and cents. The latter must keep his records accurate to two decimal places. Suppose the following inventory of raw materials was being prepared: 1,367 ft. 1 in. round iron at $5.25 per 100 ft. $71.77 11,000 ft. 2\ in. X \ in. strap iron at $7.62 per 100 ft. $838.75 etc. The first entry might be carried out to four decimal places, i.e., to $71.7675, but the last two places are of no value to the bookkeeper who is interested in accuracy only to the nearest cent. Similarly the second entry is carried to cents although the 11,000 ft. may be only an estimate and even the $838 not entirely accurate. The question of how many figures to retain does not arise in either of these cases nor does it arise in any case for the bookkeeper. The statistician is not in a similar position because more commonly he is dealing with data that are not expressed in dollars. Even when he deals with data ex- pressed in dollars the question is not likely to be whether they should be carried to the nearest cent but whether to use a unit of $100, $1,000, or $1,000,000. Statistics is often defined as the science of large numbers, and prop- erly interpreted this definition is sound, but to many it merely marks the statistician as one who works with figures containing six or eight THE USE OF NUMBERS 27 or even ten digits. Nothing could be farther from the facts than this impression. True, the statistician deals with aggregates of the magni- tude of millions or billions, but a part of his working equipment con- sists in the use of well-established rules for rounding off large numbers. Rounding OS Numbers Meaning. Precision work in a machine shop is seldom more ac- curate than to one part in a thousand. If the statistician dealing with data concerning the business world can achieve the same degree of accuracy as the machinist, the results will be amply satisfactory. Let us examine the meaning of data accurate to one part in a thousand. The average weekly earnings of factory workers in December, 1936, according to the National Industrial Conference Board, was $26.63. This figure is an average obtained by dividing total weekly payrolls by number of workers employed and means that the result of the divi- sion was somewhere between $26.625 and $26.635. Hence a complete statement of the figure would be $26.63 .005. That is, a variation of as much as .005 may be present in 26.63, or a variation of 5 in 26,630 which is equivalent to 1 in 5,326. The average weekly earn- ings figure quoted to the nearest cent, therefore, is accurate to 1 part in 5,000 approximately. From this example it will be clear that any figure quoted to four digits is accurate to at least 1 part in 2,000, on the assumption, of course, that the four quoted figures are accurate. Hence all the precision that is needed in statistical work can be provided by maintain- ing accuracy to four digits. In the preceding example this requirement was met by quoting weekly earnings to the nearest cent, but more gen- erally four-digit accuracy will be sufficient regardless of the relation of the four digits to the position of the decimal point. Significant Figures. In a single number or in the results of a com- putation the digits that show the extent to which the figure is accurate are called significant figures. Some examples will help in understand- ing this definition. The number 98,000,000 has two significant figures unless it is known from the surrounding circumstances that the zeros are an accurate representation. If the actual amount represented may be anything between 97,500,000 and 98,500,000 then only the first two figures, 98, are significant and the zeros have no other purpose than to show the position of the decimal point. On the other hand if it is known that the actual amount lies somewhere between 97,999,- 28 BUSINESS STATISTICS 500 and 98,000,500 then the original number would have five signifi- cant figures. That is, the first three zeros would be significant while the last three would serve the function of showing the position of the decimal point. Unless some indication to the contrary is given, the final zeros of a whole number are not to be considered significant. Likewise in a number less than one, zeros immediately following the decimal point are not significant. For example, .00042 has two signifi- cant figures, but .000420 has three significant figures because the final zero should be taken to mean that the operation was carried to three digits and the third one was found to be zero. That is, the actual facts are somewhere between .0004195 and .0004205. But .00042 means that the actual facts are somewhere between .000415 and .000425. The argument of the preceding section can be summarized in terms of significant figures as follows: regardless of the absolute size of any datum T not more than four significant figures need be retained for statistical purposes. Method of Rounding Off. When data are expressed to more than four significant figures, or more generally whenever a reduction in the number of significant figures is desired, methods of rounding off must be followed. There is no universally used set of rules for round- ing numbers, but a set which has wide acceptance may be stated as follows: 1. When more than five is eliminated the preceding digit should be in- creased by one. 2. When less than five is eliminated the preceding digit should not be changed. 3. When exactly five is eliminated the preceding digit should be increased by one if it is an odd number but should not be changed if it is an even number. Examples: GIVEN NUMBER ROUNDED NUUBEK 1267862 1268 8762180 8762 5863500 5864 5862500 5862 5862517 5863 Sometimes a number rounded to four significant figures is subse- quently rounded to three significant figures. This can be done by apply- T The word "data" should be used in a plural construction. The singular is "datum" referring to a single item or figure as used here. THE USE OF NUMBERS 29 ing the same rules, except for a number such as 467465. Rounded to four significant figures the number becomes 4675 and subsequently rounded to three significant figures according to Rule 3 it becomes 468, but obviously the result of rounding to three significant figures should be 467. This case is covered by an auxiliary rule sometimes followed by computers: "If a number when rounded upward ends in an even 5, indicate that fact by a prime (')." According to this rule the four-significant-figure result for the example would be written 4675' to indicate that in any subsequent rounding to three significant figures the third digit should not be increased to 8. The foregoing is a statement of the mechanics of rounding off numbers. However, the statistician's problem does not end here, because in many cases the data with which he must work will not be accurate to four significant figures and usually the degree of accuracy is not stated. In such cases formal rules must be supplemented by a knowledge of the background of particular data. We proceed, there- fore, to a detailed description of the kinds of figures which appear in statistical work and the basis for judging their accuracy. Counting and Measurement In statistical work enumerations of two kinds appear: (1) those in which the units are counted, and (2) those in which the units are measured. For example, the value of exports of 102 countries in 1935 as reported by the United States Department of Commerce was $11,580,000,000. The number of countries included in the report was an exact count and any computations based upon it would not be subject to error. The value of exports was obtained by totaling the reports of customs of the several countries after converting the different monetary units to dollars, using some agreed upon set of exchange ratios. Due to inaccuracies of reporting within individual countries, variations in methods of valuing exports and the complication of applying exchange ratios between different monetary units, the figure for value of exports is at best only an approximate measurement Cases similar to both of these appear in statistical work. Units which are counted give rise to little or no difficulty in subsequent work. They may be accurate to five or six or more significant figures but not more than four need be retained in statistical work. On the other hand units which are measured immediately lead to the question: 30 BUSINESS STATISTICS How accurate are the results? Sometimes this question is answered specifically. For example, the United States Department of Agriculture defines No. 1 Soft Red Winter Wheat as follows: Minimum test weight per bushel 60 IBs. (Scales accurate to one-tenth of a pound, hence the wheat must weigh more than 59.9 pounds to be No. 1 grade.) Maximum limits of Damaged kernels 2% Foreign material 1 % Wheat of other classes 5% (All of these are taken to the nearest per cent when tests are made.) 8 Another example which states the margin of error is presented in Table 4. TABLE 4 LONG-TERM PRIVATE DEBT: ESTIMATED AMOUNTS OUTSTANDING FOR 1912, BY CLASSES* (Amount of debt in billions and tenths of billions of dollars) CLASS or DEBT ESTIMATED DEBT PER CENT DISTRIBUTION MARGIN OP ERROR (per cent) Total 31.3 100.0 10 Railway 10.7 34.2 Public utility 5 3 168 5 Industrial 4.5 14.4 10 Farm mortgage 3.8 12.2 4 Urban real estate 7.0 22.4 15 Non. The margins of error shown in the table for private debt represent a non-statistical evalu- ation of the figures by the estimator. Statistical Abstract, 1936, United States Department of Commerce, Bureau of Foreign and Domestic Commerce, p. 273. Examples of Measurement More commonly the error to be expected is not indicated. Thus the user of the data is left to judge the degree of accuracy which can properly be attributed to them. Judgments of this sort must be based upon a knowledge of the method used in obtaining the data and a background of information concerning the source. For example, the United States Department of Agriculture announces in December the estimated crop of winter wheat for the year. The estimate as of December 1, 1937, was 873,993,000 bushels. The department receives annually about 160,000 reports from farmers in all sections of the 8 Handbook of Official Grain Standard* of the United States, United States Depart- ment of Agriculture, Bureau of Agricultural Economics, revised June, 1937. THE USE OF NUMBERS 31 country; these include estimated acreage of wheat planted and esti- mated yield per acre. The two estimates are multiplied together to give approximate production in each locality. These approximate production figures are then weighted according to the probable total production which each represents, and combined. The result is an estimate of the production of wheat in the entire country. The process is actually much more refined than this incomplete statement would imply. The results obtained, while not entirely accurate, usually prove to be within 2 or 3 per cent of the actual production recorded in agricultural censuses. An example of a different sort is the monthly report of floor space in new buildings contracted for, as compiled by the F. W. Dodge Corporation. The totals for thirty-seven states of the United States east of the Rocky Mountains are aggregates of the reports of local offices in all sections of this area, supplemented by reports from cor- respondents. The floor space in a building is estimated from the plans used in letting the contract for the building. These estimates are not intended to give the exact number of square feet of floor space; they may vary as much as 10 per cent from the actual area. When many such estimates are combined the underestimates tend to balance the overestimates so that the aggregate figure may be much more nearly correct than the individual estimates. However, in reporting non-residential construction contracts for January, 1938, as 9,637,000 square feet, an error of as much as 200,000 square feet might easily be present. A third example is the report by the Bureau of Internal Revenue of the Treasury Department on net income of corporations for a year. The aggregate net income of all reporting corporations is carefully compiled from the bureau's files, hence the 1934 income of $596,- 048,000 is considerably more accurate than the figures of either of the preceding examples. These examples indicate the extent to which a background of knowledge of methods of collection is necessary in understanding the accuracy of data. The figures in the three illustrations are not equally accurate. The crop estimate may easily be in error by as much as 1C million bushels, hence not more than two significant figures of the estimate are accurate and it might as well be stated as 870 million bushels. There is false accuracy in stating this estimate to the nearest thousand bushels, because comparisons with the quinquennial census 32 BUSINESS STATISTICS of agriculture show that the estimates usually differ by several million bushels. False accuracy is common in published data but it causes little difficulty so long as the background of the data is sufficiently familiar for users to be aware that more significant figures have been retained than is warranted. The same argument applies to the figure for floor space of con- struction contracts. It might better be stated as 9 million square feet, 9 since a variation of as much as 200,000 square feet is inherent in the method of collecting the data. On the other hand the figure for cor- poration income tax is accurate to six significant figures, but there is no need to retain more than four significant figures; hence the figure should be written as 596.0 million dollars. The zero following the decimal point is written just as a digit other than zero would be written to show data accurate to the nearest hundred thousand dollars. Significant Figures in Computation The emphasis up to this point has been on the number of significant digits to retain in a single figure or a list of figures pertaining to a single subject. We are now ready to develop methods of dealing with rounded numbers in performing computations. The rules applicable to each of the four fundamental operations will be explained in order. 10 In Addition. Each of the examples in Table 5 illustrates a par- ticular point in dealing with approximate numbers. In Example A exports from each division are given to the nearest hundred thousand dollars. This is done because the data are no more accurate than to that unit and because no greater accuracy is needed in statistical work. In the total there is no reason for retaining the fifth significant figure, and total value of 'exports for 1934 may be stated as 2,133 million dollars. In Example B the operating revenues have been carried to the near- est dollar. The data are perfectly accurate since they come from audited statements submitted to the Interstate Commerce Commission by the 9 When the size of the unit in which data are expressed is increased the change should be from single units to thousands or millions or billions rather than to intermediate sized units. Thus 12,416,736 could be stated as 12,417 thousands if it were accurate to five digits, as 12.42 millions if it were accurate to four digits, and as 12.4 millions if it were accurate to three digits. The expression of the last two examples in the form 1,242 ten thousands and 124 hundred thousands, respectively, must be frowned upon because of the potential confusion in the minds of students as to the number of zeros to be added, if one wishes to return to the original unit. 10 These rules are not applicable to bookkeeping where accuracy must be maintained to the nearest cent regardless of the number of significant digits retained in a particulaf THE USE OF NUMBERS 33 TABLE 3 EXAMPLES OF ROUNDING OFF MEASUREMENTS IN ADDITION A B VALUE OF UNITED STATES EXPORTS OPERATING REVENUES OF CLASS I RAIL- OF MERCHANDISE BY COAST AND ROADS OF THE UNITED STATES BORDER DIVISIONS, 1934 * BY SOURCE, 1934 * VALU OF EXPORTS (in millions and Diviiioif tenths of millions) North Atlantic $ 810.8 SOURCI REVENU. South Atlantic 207.3 Freight $2,629,301,525 Gulf Coast 509.9 Passenger 345,889,550 Mexican Border 48.0 Mail 91,139,847 Pacific Coast 259.8 Express 54,013,025 Northern Border 297.5 Other 151,222,875 Total $2,133.3 Total $3,271,566,822 Rounded total $2,133. Rounded total $3,272,000,000 AREA OF LAND IN THE UNITED STATES FOR WHICH TITLE REMAINED WITH THB GOVERNMENT ON JUNE 30, 1935 * UlK NUMIO OF ACHES National forests 138,710,942 National parks and monuments 8,724,737 Indian reservations (estimated net) 57,518,590 Military, naval, experimental reservations, etc. (approximate) 1,000,000 Unappropriated, but withdrawn (approximate) .197,261,754 Total 403,216,023 Rounded total 403,000,000 * World Almanac, 1936. railroads. There is, however, no advantage in retaining ten significant figures in statistical work. According to the rule the operating revenue may be stated as 3,272 millions of dollars or the figure may be carried to dollars rounded off to the nearest million as shown in the table. Example C differs from A and B in that part of the data are approximations. The area of the military, naval, and other reservations is estimated at 1,000,000 acres without any attempt to be more accurate. The areas of the Indian reservations and the unappropriated lands are likewise only approximate, yet the figures are given to an acre. There is false accuracy in these two figures, and an inconsistency in the table. The result should not be carried beyond the limit of the least accurate figure which appears to be millions of acres. The total, therefore, should be stated to only three significant figures. The conclusion to be drawn from these examples is that usually no more than four significant figures are to be retained in a sum, and when the data are not accurate to four digits fewer should be retained to avoid introducing false accuracy in the result. It is not to be implied that all cases which arise will conform to these three examples. On 34 BUSINESS STATISTICS the other hand study of these examples will provide guidance in the selection of the proper number of significant figures to retain in any set of data. In Subtraction. The rules for subtraction are the same as those for addition. For example, a method which is commonly used in measur- ing the number of automobiles withdrawn from service each year includes these steps. Number of automobiles registered, 1933 23,843,591 Number of automobiles produced for the domestic market, 1934 2,442,389 Number which could have been registered, 1934 26,286,180 Number actually registered, 1934 24,933.403 Number withdrawn from service, 1934 1,352,777 This method is not particularly accurate as a measure of cars "scrapped" because all second-hand cars taken in by dealers and not yet resold as well as cars which are temporarily unlicensed by their owners are included in the 1,352,777. At the moment, however, the rounding off of the figures is the point of interest. In spite of the fact that these figures imply an accurate count of automobiles, they are really subject to a substantial margin of error. The exact amount of error is unknown, but no inconvenience follows because there is no advantage in retaining more than four significant figures. The result would therefore be stated as 1,353,000 automobiles withdrawn from use during 1934. There is no certainty that these data are accurate even to the nearest thousand, but they would be assumed to be accurate to that extent unless definite information to the contrary were at hand. In Multiplication. The rule of four significant figures holds for multiplication just as in the preceding operations, but an additional rule must also be observed. The product of two measurement numbers must not be retained to more significant figures than the least number of significant figures in either the multiplier or multiplicand. For ex- ample, during May, 1936, 2,648,330 long tons of pig iron were produced in the United States. At that time the Composite Pig Iron Price was $19.96 per long ton. The value of the month's production was 2,648,330 X $19.96 = $52,860,666.80. But the data for pig-iron production are approximate to an unknown extent and the composite price is an average which is accurate only to the nearest cent. Assuming that the production figure is accurate to the nearest hundred tons would give five significant figures, but there are only four significant figures in the price. Therefore, the value of production should be stated to THE USE OF NUMBERS 35 only tour significant figures and could be written 52.86 million dollars. If the figure were written complete, it should be $52,860,000. The reason for this approximation will be apparent from the two following computations which show the maximum and minimum values which this product may take when the last significant figure of each number is given its maximum and minimum value. A B MINIMUM MAXIMUM 2,648,250 2,648,350 19.955 19.965 52,845,829 52,874,308 These products differ in the fourth significant figure. It is therefore apparent that nothing beyond the fourth figure is of any value and indeed that the fourth figure is not exact, although sufficiently accurate for statistical work. 11 The rule for multiplying also includes the special case of squaring a measurement number. The significant digits of the square should not exceed the significant digits of the original number. For example, the square of the measurement 26.85 should be retained as 720.9. In Division. The rule for division is: There should be no more sig- nificant figures in the quotient than the least number which appears in either the dividend or the divisor. The general rule of not more than four significant figures in a statistical calculation also applies. For example, the December 1, 1936, final estimate of the United States cotton crop for 1936 according to the Department of Agriculture was 12,399,000 bales. The estimate of acreage harvested was 30,028,000 acres. The average yield per acre is obtained by dividing the produc- tion by the acreage, i.e., ' 12,399,000 -^ 30,028,000 = .41291 bales per acre. The result may be carried to five digits according to the rule that the significant figures retained in the quotient should not exceed the number of significant figures in either dividend or divisor, whichever is smaller. However, statistical work usually requires keep- ing only four significant figures. Rounding off to four places, then, the production is .4129 bales per acre. Actually the result would be expressed in pounds by multiplying the average in bales (.4129) by 500. The average yield in pounds would be 206.5 per acre. 11 George G. Chambers, in An Introduction to Statistical Analysis (F. S. Crofts and Co., New York, 1925), would not retain the fourth significant figure. His rule is "if the product of two single number approximations is expressed as a single number approxima- tion the integer [significant figures] of the product is less than the integer [significant figures] of the least accurate factor," p. 27. 36 BUSINESS STATISTICS Suppose that the acreage harvested were not considered accurate to the nearest 1,000 acres, but were reported as 30.0 million acres accurate only to the nearest 100,000 acres. The average yield per acre would then be 12,399,000 -f- 30,000,000 = .413 bales. The result can be carried to no more than three significant figures because only three are significant in the divisor. The reason for retaining only three figures will be apparent, if the divisions are made giving dividend and divisor their minimum and maximum values. A B MINIMUM MAXIMUM 12398,300 _ 12,399,300 _ 30,050,000 ~~ ' 29,950,000 """ ' The fourth figure has no significance whatever since the two quotients do not agree even, in the third figure. As stated in the discussion of multiplication, the third figure is somewhat approximate but is accurate enough for use in statistical work. 12 In Extracting Square Roots. The reverse of the rule for squaring numbers holds for square roots. That is, as many significant figures may be retained in the root of a measurement number as there are in the number. Hence \/327 = 18.1. SUMMARY The purpose of this chapter was stated as an attempt to set forth in elementary form a background of computation methods which would facilitate the work of subsequent chapters. The first part is devoted to a review of arithmetic processes while the second deals with the rounding off of figures for statistical purposes and the rules for determining the number of significant figures to be retained. The material presented is, of course, germane to all operations with num- bers, but is particularly useful in statistical computation. The primary task of the statistician is not, however, to make of himself a "figuring fool." The most important task is mastery of the techniques which will be developed in the chapters which follow. Ability to compute rapidly and accurately must be considered as the necessary background for, but not the main object of, statistical work. 12 Perhaps attention should be directed again to the meaning of the expression "good enough for statistical work." The figure .413 bales per acre appearing in print should be taken to mean not less than .4125 and not more than .4135. The variation in either direction is .0005 on .413 or 5 on 4130 which is a variation of one part in 826 and this is accurate enough for statistical work. THE USE OF NUMBERS PROBLEMS 37 Problems 1-10 are self-tests in which the student can check his own per- formance against the standard time listed. Do not write answers in the book in order that additional trials can be made if the first one fails to meet the time limit. 1. Addition (40 seconds) 943 167 4956 2286 6269 376 742 6237 7463 4728 641 969 312 9498 8247 879 378 8468 4537 . v/ 3722 2. Addition (3 minutes) *W /^'> 876 24.29 476,876 31.35 .4832 1371.10 937 15.15 377,139 42.50 .1887 1229A8 711 41.69 991,387 1.46 .0942 782.20 492 39.63 398,872 23.59 .3948 59.35 321 23.15 814,612 5.62 .0038 2892.36 173 ' 15.12 329,388 7.35 .1850 755.73 288 4.28 376,441 66.75 .3763 3842.45 317 16.14 114,473 103.43 .0382 4721.21 222 34.99 787,224 35.78 .0976 2783.29 384 55.29 716,3,26 7.11 ^i (j ^i * C 1 \* ^\ ^^* Mi| .1956 1972.48 3. 'OlT /v-i'^ -> i,-v^ 7 , Subtraction (35 seconds) ^ iM' * i V^ M & o\ : tf 1090 8617 31.762 217.32 $27,218.45 $586.89 585 7758 4.86 29.685 11,216.10 497.98 4. Multiplication (1 minute, 30 seconds) 921 875 726 486 1269 8296 23 19 68 35 137 864 5. Multiplication (2 minutes, 30 seconds) 3.8125 34.4167 .2976 620.14 8.875 21.72 1.093 7.963 6. Division (1 minute, 15 seconds) 237) 50481 593) 28464 418) 240768 7. Division (carry to four significant figures) (3 minutes, 30 seconds) 29.57) 128.43 .2448) 107.321 224.08) 3.11417 8. Multiplication by short-cut methods (1 minute, 15 seconds) 793 X 25 65X65 2641X33* 93 X 107 732 X 199 47X47 38 BUSINESS STATISTICS 9. Multiplication by short-cut methods (2 minutes) 2.183 X .875 892 X 908 48027X901 81X81 115 X 115 176 X 176 10. Division by short-cut methods (1 minute) 5418212 -r- 25 83.47 -*- .05 .4983 -5- 12.5 11. Find the value of a) (28 X 37) + (12 X 16) - (31 X 29) *) 3 + (9 X 36) - (22 X 11) + 486 + (138 ~ 6) 1417-12(16+9(21-8)} <0 (86 X 22) + 44 + (98 + 7) X 210 - 432 ~ (12 X 12) 1217[81 X {5952 +- (31 X 32) + 4} - 900] 12. Find the value of *) i + i+i+yV *) i - y + 7 3f+10A-6ii 13. A can do a piece of work in 15 days, B can do the work in 18 days, and C in 25 days. What fraction of the work will the three working together perform in six days? 14. In Problem 13, after A has worked four days and C six days, what is the difference in the fraction of the work performed by the two? 15. 12^X1055 -f-649J-= ? 16. Mr. Smith invested $27,500 in a partnership having a total capital of $200,000. He later sold J of his holding in the concern. What fraction of the ownership of the concern did he retain? What fraction did he sell? What amount should Mr. Smith receive of a $12,000 profit, (1) origi- nally, (2) after selling 1 of his equity? 17. If 16 items are to be plotted at equal distances and centered on a sheet of graph paper 9} inches wide and the space allotted to each item must be a multiple of \ inch, how much space will be left for margins at each side of the paper? 18. If 14-j tons of coal cost $122-^ what was the cost per ton? 19. Express the following as (a) decimals, (b) per cents: ^ ff> -^ r \ T , -f^, 3g, 13 A- 20. Express the following as (a) common fractions, (b) per cents: .06, .003, .004167, .65, 3.1875. THE USE OF NUMBERS 39 10 per cent, J.AJLL4 UvJJJ V/JL XX Wi.VXUJJ.XXO 21. Express the following as (*) common fractions, () decimals: 6 per cent, 18f per cent, per cent, 262 per cent. 22. Arrange each of the following in ascending order of value: O) -43, TV, 37.5 per cent, .4, lo (#) i per cent, .086, sir, iro per cent, ToVo 23. The spoilage on two crates of oranges each containing 210 oranges was 17 per cent and 33 per cent, respectively. Find the income from sale of the unspoiled oranges at 45 cents per dozen. 24. Assessments in a city are maintained at 70 per cent of market value. If Mr. White pays $302.40 tax when the tax rate is $30 per thousand of assessed valuation, what is the market value of his property? 25. The balance sheet of a concern showed the following: Cash $ 7,500 Accounts receivable 38,500 Inventory 15,750 Investments 9,050 Plant 120,000 Equipment 69,200 Total assets $260,000 Each type of asset was what per cent of the total? 26. An article cost $450. At what price should it be sold (a) to make a profit of 55 per cent on cost, (b) to make a profit of 45 per cent on the selling price ? 27. If a worker's wages are cut 25 per cent and subsequently increased 25 per cent, the most recent wage is what per cent of the original wage? 28. Given the following: MONTH GROCERY SALES No. OF DAYS STORE WAS OPEN July $28,412 24 >i' August 29.827 26 Find the per cent of change in average daily sales in August compared with July. 29. The following information is available concerning the manufacture of a particular article: 1939 1940 No of units produced 200 000 275 000 Overhead costs $50 000 $50 000 Variable costs $100,000 $120,000 Sales income $200.000 275,000 a) The per cent of profit on selling price increased by what per cent in 1940? 40 BUSINESS STATISTICS b) The per cent of profit on cost increased by what per cent in 1940? c) The overhead per unit was what per cent of the selling price per unit in 1939? in 1940? d) The variable cost per unit was what per cent of the selling price per unit in 1939? in 1940? e) What discount on selling price could the manufacturer have offered in 1940 and still have maintained the same rate of profit as in 1939? 30. Find the square root of each of the following by arithmetic process and check the result in a table of square roots or by logarithms. a) 360046 c) 9.62048 b) 65.604 d) 12089.37 31. What is the degree of accuracy of each of the following measurement num- bers? Express the answer as a common fraction and as so many per 1,000 or per 10,000 whichever is preferable: (*) 67 (d) 4208 (b) 18.2 (e) 508.0 (c) 4200 (f) .0007 32. How many significant figures are there in each number of Problem 31? 33. a) Round each of the following numbers to four significant figures. b) Round each of the following numbers to three significant figures (1) 787428 (5) 9989.47 (2) 13004 (6) 695.451 (3) 27.998 (7) 164850 (4) 4055.5 (8) 28.9950 34. Which of the following are counting numbers and which are measurement numbers ? a) The three plots which Mr. Jones purchased were, respectively, 40 ft. X 120 ft, 100 ft. X 150 ft., and 20 ft. X 2 50 ft. The total area was, therefore, .57 acres. b) The 65 persons who were on the payroll sometime during the year were the equivalent of 43 full-time workers and the total payroll was $62,712.85; hence the average annual wage per equivalent full-time worker was $1,458. 35. How many figures would you expect to be accurate in each of the following? Give reasons for your answer in each case. All of the examples were taken from the Statistical Abstract of the United States, 1938. a) The population of the United States was enumerated in 1930 as 122,775,046 persons. b) The population of the United States was estimated by the Bureau of Census in 1938 as 130,215,000 persons. c) The Office of Education of the Department of Interior reports the enrollment in colleges, universities, and professional schools in 1936 as 1,062,760 students. THE USE OF NUMBERS 41 d) The total assets of all member banks of the Federal Reserve System on the December 31, 1937, call date were $46,785,512,000. e) The Bureau of Foreign and Domestic Commerce of the Department of Commerce estimates from a sample collection that the total retail trade of the United States in 1937 amounted to $39,930,000,000. In each of the following problems express the summary figures to the correct number of significant digits. 36. A LIABILITIES OF FEDERAL INTERMEDIATE CREDIT BANKS, DECEMBER 31, 1937 VALUE LIABILITY (thousands of dollari ) Paid in capital and surplus, United States government 100,000 Surplus, earned reserves and undivided profits* 12,561 Debentures outstanding (unmatured) t 174,950 Total 287,511 *Net amount after deducting impairment or deficit. tAdjusted for debentures held by banks of issue and by other federal intermediate credit banks. B PRODUCTION, TRADE, AND SUPPLY AVAILABLE FOR CONSUMPTION OF RAW SUGAR, CONTINENTAL UNITED STATES, 1935 QUANTITY ITEM (Aort font) Production (beet and cane only) 1,651,000 Brought in from insular areas 2,686,969 Imports as sugar 2,372,066 Exports as sugar 103,349 Exports in other forms 13,220 Available for consumption 6,593,466 37. (a) In the following table, how many significant figures should be retained in the total consumption? (b) Assuming the accuracy of a population of 129,257,000 in 1937, what is the per capita consumption of meats in the United States? PRODUCTION, FOREIGN TRADE AND CONSUMPTION OF ALL MEATS IN THE UNITED STATES, 1937 AMOUNT ITEM (million pounds) Production Federally inspected 10,273 Uninspected (estimated) 5,299 Exports of United States production 164 Imports for consumption 263 Net change in storage stocks, decrease 402 Consumption 16,073 38. Find the value of a corn crop estimated at 4,000 bushels, if it was sold for 87 cents per bushel. 39. A motorist drove 3,532 miles from Boston to San Francisco, using 207^ gallons of gasoline. What was his average mileage per gallon? 42 BUSINESS STATISTICS REFERENCES CHAMBERS, GEORGE G., An Introduction to Statistical Analysis. New York: F. S. Crofts and Co., 1925. Chapters I and II deal with measurement, approximation and significant digits. EDGERTON, EDWARD I., and BARTHOLOMEW, WALLACE E., Business Mathe- matics. New York: The Ronald Press Co., 1923. Chapter XII explains short methods of computation and various checks of accuracy. LANGER, CHARLES H., and GILL, T. B., Mathematics of Accounting and Finance. Chicago: Walton School of Commerce, 1930. The first five chapters contain a detailed statement of the fundamentals of arithmetic and algebra. Short cuts are presented in pp. 43-64. MURPHY, PATRICK, Short Practical Rules for Commercial Calculations. Albany : Weed-Parsons Printing Co., 1910 (originally printed in 1886). A highly stimulating presentation of short-cut methods for the student who cares to pursue the subject at length. WALKER, HELEN M., Mathematics Essential for Elementary Statistics. New York: Henry Holt and Co., 1934. Chapters I- VI contain material similar to part of the text. Chapter II on significant figures is particularly pertinent. CHAPTER III STATISTICAL INVESTIGATION THE CHARACTER OF STATISTICAL INVESTIGATION THE EXTENT to which the work of the statistician underlies the conduct of business affairs was discussed in chapter I. Some- times the contribution which he makes is relatively simple, being confined merely to presenting sales figures graphically. On the other hand his task may consist in a study of sales records and indexes of regional purchasing power for the purpose of determining sales territories and establishing sales quotas, or the study of a sample of output to determine whether it meets contract specifications. Whatever the complexity of a particular problem, the sequence of steps followed in its solution involves the application of statistical method. Definition of Statistical Method The statistical method is essentially the use of the principles of scientific investigation in the study of aggregates of numerical infor- mation. Just as the physicist must develop laboratory methods and techniques for examining the theories of sound, light, etc., so the statistician must have methods of appraising the theories of proba- bility and sampling in terms of the observed phenomena (numerical data) of the business world. The problem of the statistician is com- plicated considerably by the fact that business operations cannot be subjected to the control that is possible in the physics laboratory. As a result the methods of statistical investigation are those research pro- cedures developed to meet the peculiar requirements of the problems arising in the conduct of business affairs. An example will demonstrate the difference between the controlled conditions of the physics laboratory and the uncontrolled conditions of the statistics "laboratory." The physicist wishing to read the height of a column of mercury in a manometer tube sets up his apparatus, provides for a constant temperature in his laboratory, selects a time at which barometric pressure is stable, and proceeds to take a large number of readings on the scale attached to his apparatus. The aver- 43 44 BUSINESS STATISTICS age of a large number of such readings will be the theoretically best value for the height of the column of mercury in the tube. In contrast to this, suppose that a statistician wishes to determine the per cent of the employable workers of the United States who are unemployed as of a given date. He might also elect to take a large number of inde- pendent readings of the phenomenon under investigation and take the average of the results as the best value for the per cent of workers unemployed. But he encounters a whole mass of preliminary problems before any observations can be made and none of these can be "con- trolled." He must define "unemployed person," "employed person," "employable person," and no matter how carefully these definitions are phrased doubtful cases will arise. He must determine how to select samples of the population which will be representative, and even the most meticulous care will not produce a result comparable with the stability of the column of mercury in the physicist's laboratory. These and similar problems have forced the statistician to develop methods of investigation which are peculiar to the type of data with which he deals and the uncontrolled conditions under which he must use them. The employment of statistical methods in the solution of business problems belongs almost exclusively to the twentieth century. At an earlier date when business enterprises were small, management was able to comprehend its problems in detail by personal contact. The increased size of concerns in the present period has required more planning and greater regimentation of operations. At the same time management has found it impossible to maintain personal contact with its problems. The alternative is control through the interpretation of numerical information. This chain of circumstances has led to the introduction of statistical methods of investigation as a primary aid in the performance of the function of management. The Use of Statistical Method Masses of Data. The methods used by life insurance actuaries give no information concerning the time at which a particular insured person will die, but they give very accurate information concerning the number of persons who are likely to die in any year out of a large number of a given age alive at the beginning of that year. Life insurance premiums are based upon the regularity of death rates among large groups of persons, not on a guess as to how long an indi- vidual will survive. Similarly, a study of department-store experience STATISTICAL INVESTIGATION 43 may show that bad debt losses on charge accounts amount on the aver- age to about 1 per cent of charge sales. It does not follow that an individual store must have bad debt losses of 1 per cent. This result can be applied to particular cases only by taking account of the relation of conditions in the individual case to the average conditions found in the large group. The individual store may have 2 per cent bad debt losses in a certain year due to the fact that its customers have been experiencing the effects of a great amount of unemployment. In an- other year when its customers are fully employed the same store may have only of 1 per cent bad debt losses. Another example is the use of income tax statistics in the determination of sales quotas. Studies show that the higher the percentage of the population of a state filing income tax returns, the higher the percentage of the popu- lation purchasing automobiles. This relationship can be used to estab- lish sales quotas for automobile agencies in the various states. It does not follow, of course, that those persons who file income tax reports will necessarily purchase automobiles, but that the higher purchasing power evidenced by the larger percentage filing tax reports will be available for the purchase of automobiles. Hence intensified sales effort where the purchasing power exists should produce the best sales results. These three examples show how management uses the results obtained from the study of mass information. The typical situation found in the group is used as a guide for action within individual concerns. Case Investigations. There is, however, one type of statistical work known as the case method which does not deal with masses of data. An individual case is studied intensively, usually over a period of time, in order to make a complete analysis of its operations. The case may be one individual, a single family, a business concern, or any other similar entity. In statistical work case investigations are of less frequent occur- rence than mass data investigations. More often than not case studies eschew statistical method entirely and rely solely on historical descrip- tion in the presentation of results. A case study is characterized by the establishing of such a strong personal relation between the investigator and the person or persons furnishing information that a vast amount of detailed information can be obtained concerning the case. In pre- senting the results, each case is written up separately and represents a 46 BUSINESS STATISTICS complete investigation in itself. The distinguishing feature of the case method is the fact that a detailed description of the individual case is the objective. The records maintained by physicians concerning their patients become most complete life-histories, sometimes covering the entire span from the cradle to the grave. These records are, of course, con- fidential, but their anonymous publication would provide a remark- able background for the study of sickness and health problems. In a similar fashion a file for a period of years of a financial manual such as Moody's Manual of Investments, giving as it does a brief case his- tory of many individual corporations, becomes a compendium of invalu- able case records of the founding, growth, financial organization, and in some instances the decline of individual concerns. These are avail- able for study either as individual cases or collectively as the raw material for statistical analysis. Case study is used infrequently in the investigation of business prob- lems. It is a method that is well adapted to studies of social phenomena and has been widely used in the field of social work. As such it lies outside the scope of this book. THE CANONS OF STATISTICAL,' -INVESTIGATION The attitude of the statistician toward his work is a matter of con- siderable importance. His methods are equivalent in the field of social science to those employed in the exact sciences by the chemist, physicist, and biologist and his attitude toward his work must be equally scientific. Under no circumstances can he become an advocate or a special pleader. Statistical work done for purposes of pleading does not deserve the name of scientific research. As a means of promoting the scientific character of statistical inves- tigation there are certain standards or requirements which should be uniformly maintained. These fall naturally under three heads, each of which requires detailed explanation. Definite Object Statistical investigation is never aimless. It is always directed to the solution of a specific problem. The problem may be as basic as finding the total annual income of the nation or as circumscribed as a study of the amount of flour hauled on the New York State Barge STATISTICAL INVESTIGATION 47 Canal during September, 1937. But regardless of scope the purpose must be specifically defined. Unless this requirement is met, direction will be lacking in the investigation, unnecessary work will be done and results of questionable value will be obtained. It is essential there- fore to have the exact object of an investigation fully understood before any other work in connection with it is undertaken. At all subsequent stages of the investigation the purpose must be kept in mind as a guide in the planning and execution of the project. Unbiased Attitude The statistical investigator sets out to determine by investigation the facts concerning a given problem, but not to prove a certain thesis. There are times when it is very difficult to maintain an unbiased atti- tude. Some questions are of such controversial character that even the most detached investigator finds himself influenced. On the other hand in reading the report of an investigation one frequently has a feeling that the author has "leaned backward" to avoid bias. This is the proper attitude of a careful investigator when he finds himself placed at the center of a controversy. Conscious or unconscious bias may appear in statistical work. Con- scious bias can be dismissed quickly. A person who willingly distorts statistics for the purpose of proving a preconceived idea should not be called a statistician. He is a propagandist. It is necessary to be vigilant at all times to avoid using results containing bias. Conscious bias may appear in one or several of the following forms: (1) direct misstate- ment, (2) ambiguous statement, (3) the use of only favorable data, (4) concealed shifting of units of measurement, (5) deliberate selec- tion of incorrect techniques, and (6) misleading forms of presentation. Careful study is usually required to detect unconscious bias. Perhaps it would be safe to assume that all statistical interpretations contain some bias but that in most cases it is not present to a harmful degree. This is only another way of saying that the results of statistical work must be interpreted by human beings, each of whom can interpret only in terms of his own experience and his attitude toward the problem at hand. An excellent example of unconscious bias appears in the writ- ings of certain statisticians and economists who during 1928 and 1929 interpreted the trends of the times to mean that permanent prosperity at the then existing levels had been attained. Subsequent events have shown that these men were so enamored of the favorable factors that 48 BUSINESS STATISTICS they overlooked the growing stresses in our economic system. Their biased attitude is apparent now, but at the time their teachings had a wide acceptance. Skepticism The beginner in statistical work is likely to have the attitude that numerical facts can be accepted without question. A few adverse experiences will usually dispel this initial trustfulness. The attitude of faith should then be replaced by skepticism or in the extreme by cynicism, because it is far better to err in that direction than to develop enthusiasm with its attendant misinterpretation. Many of the fallacies which appear in statistical presentation arise from failure of those responsible for the results to maintain a critical attitude toward their work. STEPS IN STATISTICAL INVESTIGATION As stated at the beginning of the chapter, there is a logical sequence of steps to be followed in statistical investigation. An outline of these steps will give the reader a view of the process as a whole prior to studying the details. I. Statement of the problem II. Preliminary planning of the investigation III. Collection of data A. Library sources B. Direct sources IV. Analysis of data A. Editing of collected information B. Tabulation C Ratios D. Graphs E. Measures of central tendency F. Measures of dispersion and skewness G. Index numbers H. Time series analysis and application I. Correlation and variance J. Tests of reliability of samples V. Interpretation and application of the results of analysis VI. Preparation of a report of the completed investigation STATISTICAL INVESTIGATION 49 The remainder of the book is devoted to a detailed presentation of the work involved in following through the several steps. Although the emphasis given on subsequent pages to the different parts of this outline depends upon the difficulty and ramifications of the particular subject, it is to be hoped that the reader will not lose sight of the fact that with one necessary exception he is following the outline step by step. The interpretation of the results of each type of analysis is not a procedure that can be relegated to a separate section of the book. Therefore the discussion of illustrative examples has been woven into the text wherever it has seemed desirable. THE SCOPE OF DIFFERENT INVESTIGATIONS The six major steps presented here cover the complete sequence of things that must be done in conducting an investigation, no matter how limited or how broad its scope. The subheadings are partly alternative and partly sequential depending on the character of a par- ticular problem. The amount of detailed planning required and the time consumed in executing the plan will, of course, vary with the size and importance of the investigation. The type of planning in turn is directly related to the question of whether the investigation is in- ternal or external in character. An internal investigation is one which deals exclusively with conditions within a single business concern or agency. Those investigations that originate outside the management of any particular business concern are called external. There is one great difference between internal and external investigations: the former as a rule present no serious problems of collecting data, whereas the latter are seldom free from such problems. Internal Investigations Statistical studies by a business concern of its own records are usually conducted to obtain information needed to assist management. The most common examples are found in the work of the cost accounting department. The data for determining unit costs are found in the accounting department and in the plant production records. The task of allocating the various factory and overhead costs to obtain an aver- age cost per unit of product requires the use of statistical techniques. The cost accountant ordinarily does not employ all of the steps of investigation as outlined. His task is a rather circumscribed one, but it 50 BUSINESS STATISTICS is necessary for him to be familiar with the complete process in order to make intelligent use of the part that he needs. Internal investigations of a more general character are a necessary part of business control, and these are likely to make use of more of the steps of statistical investigation. For example, an oil company wishing to study the weekly rhythm in the sales of gasoline at its filling stations in different parts of a city would have no difficulty in collecting data, since that information would be included in the daily report of the manager of each station. Combining the sales reports for a number of weeks to obtain an average relationship would involve certain adjust- ments for weather conditions, for any irregularities at the station which might affect sales, for holidays and other circumstances of a similar nature. Once the pattern of the weekly rhythm at each station had been obtained, the next step would be to study these patterns in pairs and groups to discover in what parts of the city similar rhythms appeared. This information could be used by the central office in assigning attend- ants so as to provide the maximum service to customers, in planning delivery schedules of tank trucks, as well as in planning the location of new stations. In this example the emphasis is on analysis and interpretation, and that will be found commonly true of internal investigations. Although relatively simple statistical techniques are involved in this case, trust- worthy conclusions depend upon following the steps of investigation faithfully. This leads to the general observation that a knowledge of the steps of statistical investigation and the relation of each to the whole is necessary to protect the statistician from error, even though his particular problem involves the use of only a part of the whole procedure. External Investigations Investigations conducted by manufacturers' and trade associations, advertising agencies, research bodies, universities, and government agencies are ordinarily more general in character than internal investi- gations. Correspondingly they call for the use of a wider range of statistical techniques. In particular, the preliminary planning and the collection of data demand much more attention in external investiga- tions. For example, the plans for the 1940 census of population were under way as early as 1937 in the Census Bureau at Washington and in the field with co-operating agencies. The field work required only STATISTICAL INVESTIGATION 51 a few weeks early in 1940, but a large staff will be continuously engaged for the next decade in preparing and publishing the various tabulations and analyses of the collected information. The entire process, in reality a continuous statistical investigation of the popula- tion, falls within the framework of method outlined in the preceding section. Sometimes the scope of an investigation is limited in the sense that not all of the successive steps are carried on by a single agency. The task may be confined to collecting data. If so, the steps following collection can be ignored, but those preceding actual collection must be given proper attention. Again the particular task may be confined to interpretation and presentation of data collected by someone else, but the preceding steps must be thoroughly understood before any attempt is made to explain the meaning of the results. SUMMARY All of this discussion points to the same conclusion, namely, no matter how simple a particular piece of statistical work may be, its execution requires a knowledge of the steps in statistical investigation. Through this principle the various details of statistical method and technique are welded into a unified whole. Succeeding chapters are arranged so that the steps of statistical investigation will appear in natural sequence. Within this sequence the methods of analysis progress from those using simple techniques to those which are more involved. PROBLEMS 1. What are the differences between research in the natural sciences and statistical research? 2. Describe an example from your own experience of the use of mass data in statistical work. 3. State a definite subject for investigation in each of the following fields: (a) automobiles, (b) cost of living, (c) athletics, (d) profit. 4. What changes would you suggest in the conclusions reached in each of the following examples ? a) Hourly wage rates in industry have increased uninterruptedly for the past 20 years, and the cost of living is lower today than it was 20 years ago. Therefore the living standard is higher today than it has ever been in the past. 52 BUSINESS STATISTICS b) In the District of Columbia in a recent year one male automobile driver in every 1,370 was involved in a fatal accident and one female driver in every 9,090 was involved in a fatal accident. Therefore women are safer drivers than men. During World War I the United States army lost 126,000 men killed in action, died of wounds, and died from disease or other cause out of 4,355,000 men mobilized, a death rate of 28.9 per 1,000. During the years 1917 and 1918 the death rate for the United States exclusive of the armed forces was 32.4 per 1,000. Therefore it was safer in the army than at home. 5. Explain the differences between internal and external investigations. 6. Which of the steps of statistical investigation were employed in preparing the reports appearing as Examples 1, 2, and 3 in chapter I, pages 7-9? Give references in the examples of specific statements which indicate the use of the several steps named in your answers. REFERENCES CROXTON, FREDERICK E., and COWDEN, DUDLEY J., Applied General Statistics. New York: Prentice-Hall, Inc., 1939. Chapter I emphasizes the parts of statistical investigation and the kinds of errors that appear in statistical work. MILLS, FREDERICK G, "On Measurement in Economics," The Trend of Eco- nomics. New York: Alfred A. Knopf, 1924, pp. 37-72. An advanced statement of the place of statistical investigation in the realm ">f science. SECRIST, HORACE, "Statistical Standards in Business Research," Quarterly Pub- lications of the American Statistical Association, Vol. XVII, No. 129 (March, 1920), pp. 45-58. An article that helped to establish standards in a period when business research was less firmly established than at present. SPAHR, WALTER E., and SWENSON, RINEHART J., Methods and Status of Scien- tific Research. New York: Harper and Bros., 1930. Every embryonic statistical investigator should be familiar with the point of view expressed in chapters I-VI. CHAPTER IV PRELIMINARY PLANNING OF INVESTIGATIONS INTRODUCTION INEXPERIENCED collectors of data sometimes make the mistake of jumping directly into the task of collection without an adequate comprehension of the problem with which they are dealing. This practice should never be followed no matter how simple and direct the problem may appear to be. There are always preliminary points which should receive attention prior to the actual collection of data. The four major steps which should be followed are: (l) define the problem; (2) study the problem; (3) plan the procedure; (4) pre- pare a statement of the program. DEFINE THE PROBLEM At the outset a crude statement will serve as a focus for the initial consideration of what is involved in the problem, but the crystallization of a few ideas will very quickly provide a mental setting and point to some of the limitations which should be established as a basis for more careful planning of the investigation. These preliminary ideas should be brought together in a more complete definition which will indicate the subject to be investigated, the exact object of the investigation and the limitations upon its scope. An example will demonstrate the difference between an incomplete and a complete statement of the subject for research. Suppose that the statistician were to receive the following problem: "The sales of our company declined last month. This decline was unexpected, since all parts of the organization appeared to be unusually busy. Investi- gate the matter." This statement does not define the problem for research. It could be taken to mean that an investigation was wanted of why the organization appeared to be unusually busy; but assuming that an investigation is wanted of why sales declined, the statistician needs more information before proceeding with the work. Questions such as the following must be settled: 33 54 BUSINESS STATISTICS Are all products to be included or only major ones? If the latter, which products? Is the investigation to be confined to discovering the facts or shall it include data pertinent to discovering the cause of the decline? How much time is available for making the investigation? Have the affected departments agreed to co-operate? With questions of this sort settled, the problem might be restated somewhat as follows: "Investigate the extent of the decline last month in the sales of the five major products which we manufacture and pre- sent as much collateral information as possible to aid in determining the cause of the decline. Your report should be available prior to the directors' meeting which will be held three weeks hence." The purpose of the investigation is clear and the limitations as to time and scope are definite. This example illustrates the type of definition required in an internal investigation to be carried out within a short period. Larger problems will require correspondingly greater amount of definition. STUDY THE PROBLEM Read About the Problem ' A knowledge of previous work that may have been done on a prob- lem should be acquired as background before a new investigation is undertaken. The existence of earlier studies can be determined pri- marily through a search of library files. One may find that the problem has been investigated previously and that any further investi- gation should be built upon the existing work. Again flaws may be found in the previous work which make it completely or partially useless for the purposes in view. The chief value of studying such previous investigations may lie in discovering what not to do. Library search may disclose the fact that no similar statistical inves- tigation has been made previously. But books and magazine articles may be discovered which give factual information dealing with some phase of the subject or clues concerning methods of investigation. Library reading on the subject will aid the investigator in avoiding duplication of work already done; in avoiding the errors made in previous investigations; in discovering methods of approach and pro- cedure; and in acquiring a broad perspective of his problem. Finally there will be some cases in which no usable information of PRELIMINARY PLANNING OF INVESTIGATIONS 35 any sort can be gleaned from library search. When this happens the investigator must be prepared to proceed without such assistance. He must be able to supply from his own experience the background that otherwise would have come from library reading. Think the Problem Through 'At this stage the investigator should take some time for thoughtful consideration of his problem. There arc major parts of his plan which should be settled. Certain parts may need additional emphasis and others should perhaps be discarded. New phases may enter as a result of the reading done. The knowledge which has been acquired by reading needs to be related to the particular problem at hand. The investigator should be able to visualize his entire procedure. At first this should be confined to the main outline of the work, and following that the details should be considered. In constructing this mental image of the work it is unwise to assume that the preliminary planning can ignore details that are apparently simple, for they may contain difficulties. The success of the true investigator lies in his ability to foresee these concealed difficulties and make provision for them. 1 The case of the student who wrote to a number of cement companies asking for production in tons and price per barrel of cement will illustrate the point. The companies had to change their figures, which were recorded in barrels, to tons to meet the student's questions and the student in turn had to change from tons back to barrels when he tabulated the data. A small amount of foresight would have avoided the difficulty. A word of caution may aid in avoiding misinterpretation of the pre- ceding paragraph. While the plan of procedure should be thought out very carefully, it is scarcely to be expected that no subsequent changes will be necessary. Regardless of how efficient the investigator may be, it is unlikely that he can foresee and provide for every con- tingency which may arise. iThe plan should therefore be sufficiently flexible to permit necessary adjustments to conditions as they develop. PLAN THE PROCEDURE The amount of planning needed will be determined by the com- plexity of the problem and the size of the investigation. In some cases the points discussed in this section will take care of themselves, but 36 BUSINESS STATISTICS more commonly decisions concerning them must be made prior to beginning the collection of data. Under either circumstance considera- tion must be given to the elements of the plan lest some essential be overlooked. Library Sources and Direct Sources The reading which has been done on a problem should indicate fairly well whether the needed data will be available in libraries or whether recourse must be had to direct spurces. It will not serve merely to remember that some data on the subject were referred to in a book or magazine article. The data must be found and examined. Then several questions must be settled. Are the data in usable form? Do they include the desired time period? Do they cover the proper area, i.e., nation, state, locality, etc? Are they expressed in the correct unit for the particular purpose? Are they reliable? By the time these questions have been investigated the general problem of whether or not library sources can be used will have been definitely settled. If library sources can be used, the investigator is ready to move on to the next part of his plan. 1 If on the contrary library sources should fail to provide any or all of the required data, the possibility of securing them directly must be canvassed. Using direct sources 2 means going to the business concerns, agencies, or individuals possessing the information to obtain at first hand data which do not exist anywhere in print. In some cases the preliminary survey may disclose the fact that the desired data cannot be found in library sources and that they are equally unavailable from direct sources. 1 The details concerning the collection of data from library sources are presented in chapters IX and X. 2 The classification of data as library and direct, a distinction according to source, represents something of a departure from the usual classification found in textbooks. The customary division into primary and secondary data places the emphasis on the number of times that data have been recorded, i.e., primary data are those which are being recorded for the first time by the investigator who assembles them, whereas any subse- quent recording of the data by other than the original investigator makes them secondary data. For example, the report of steel production found in the Annual Statistical Report of the American Iron and Steel Institute is primary data and the Report is a primary source, whereas the same figures published in the Survey of Current Business of the United States Department of Commerce become secondary data, and the Survey a secondary source. The names, primary and secondary, suggest that the former are more reliable than the latter. The point of view of this book is that reliability does not depend so much upon the number of times the data have been handled, as upon factors related to the canons of statistical investigation discussed in the preceding chapter factors which are as lively to affect one kind of source as another. The distinction between library and direct sources places the emphasis on methods of procedure. One type of research is required to obtain data already available for gen- eral use, but quite a different type of work is required to obtain data directly from the originating source. PRELIMINARY PLANNING OF INVESTIGATIONS 57 For example, recently the Minimum Wage Division of the New York State Department of Labor undertook to obtain data on hours worked daily and weekly by operators in beauty parlors. The nature of the work in this trade is such that hours actually worked vary greatly from stated schedules, but are unrecorded except in the very large establishments. As a result, the field workers found it very difficult to secure accurate information. When obstacles of this sort are discovered, the choice lies between abandoning at least those diffi- cult features of the investigation, or continuing with the understanding that the results will have only conditional validity. On the other hand if the preliminary survey indicates that the data will be available from direct sources, the investigator is ready to enter into the detailed plan- ning of the work of collecting them. Sometimes a combination of library sources and direct sources can be used. For example, in comparing wage rates in a particular com- munity with rates for similar employment in the entire state and the entire country, it might be feasible to obtain the data for the state and the nation from the reports of the United States Bureau of Labor Statistics, whereas the local data would have to be secured directly from business concerns in the community. In all such cases it is desir- able to make as much use as possible of library sources. If the data required for an investigation can be obtained from library sources, the procedures discussed in the nex!: sections will not be needed. On the other hand if direct sources must be used, the investigator must be thoroughly familiar with the principles of sampling and with the practical technique of the collection process by the method of either sample or census. Census and Sample In some investigations it is desirable or even necessary to make a complete enumeration. This is known as the census method. The cen- sus method is used in part of the statistical work of the federal govern- ment. The decennial population census is a complete enumeration, as are the Census of Manufactures, the Census of Business, the Census of Agriculture, and others. Other complete collections of data are by-products of the tax-collecting function of the government. Examples of these are the statistics of imports, corporate and individual incomes, cigarette consumption, and gasoline consumption. In contrast to these cases of complete collection of data are the 58 BUSINESS STATISTICS great majority of external investigations in which the census method is impossible. Instead of collecting all of the information concerning a given subject, these investigations depend upon obtaining a sample which will be representative of the whole. The methods of securing a representative sample will be discussed in detail in chapter V. We are interested here merely in pointing out that results representing a large population of items can be obtained by the use of a sample. In constructing a wholesale price index no attempt is made to include the price at which every wholesale transaction is made. The prices of only a few articles in important markets are used. The American Experience Mortality Table, giving the age at death of an initial 100,- 000 persons at age 10, was constructed from a large sample of insured lives. Crop reports of the Department of Agriculture are based upon information received from local reporters in all parts of the country who have in the aggregate a knowledge of the condition of no more than 3 to 5 per cent of the acreage planted. Internal investigations should use complete enumeration when fea- sible because of the greater accuracy, but situations may arise in which the size of the investigation or the difficulty of securing data even within a single concern preclude the use of all of the data. A good case of partial enumeration occurred when a large mail-order house desired to have a check at two weeks' intervals on its gross profit or difference between total income from sales and cost of goods. The income from sales could be obtained from the accounting department, but to get the cost of goods sold in any two weeks' period would have been impossible from the point of view of both time and expense. The concern therefore took 100 orders at random 3 and computed the cost of the goods included in those orders. The results were applied to the 19,000 orders which the concern filled during a two weeks' period. 4 The method used did not lead to an exact answer to the problem, but the results were good enough to allow a current check on selling prices. In any event the time involved in using complete data would have forced the company to abandon the idea. 8 A random sample results from taking individuals from a group by some system that is in no way dependent upon the characteristics of the items chosen, so that the presence in the sample of any particular item is left entirely to chance. In this case the equivalent of a random sample could be obtained by taking every 190th order that appeared on the sales record. A similar result could be achieved by taking the first 10 orders recorded each working day of the two weeks' period. Any methods similar to these would serve the purpose of providing a sample of 100 orders which would be representative of the 19,000 filled during the period. 4 Example taken from page 76 of M. A. Brumbaugh and R. Riegel, Study Problems in Business Statistics. New York: American Book Co., 1935. PRELIMINARY PLANNING OF INVESTIGATIONS 59 A decision must always be made on the question of census or sample. Sometimes attendant circumstances will make the decision almost automatic; in other instances they may complicate it. A case in point is the unemployment survey made at the close of 1937 by the federal government. The necessity of having quick results made it desirable to rely on a sample. On the other hand previous experience indicated that only a complete enumeration would be reliable. A com- promise plan was used in which the reporting form was distributed by mail carriers to every family, and various channels of publicity were used to urge all unemployed persons to fill out the form and return it to the local post office. There was considerable doubt whether the reporting would be complete, so a check was made by house-to-house canvass in selected cities, villages, and rural areas. The check showed that the voluntary reporting was about 72 per cent complete, but that there was considerable variation in the completeness of registration from one locality to another. The sacrifice of correct method to secure a quick report lessened confidence in the result. The Collection Method. Agejits and Mail Questionnaires If it has been determined that a complete census must be taken, there is practically no choice as to method. Only by the use of agents for personal interviews and follow-up visits can 100 per cent collection be guaranteed. Even with the aid of compulsion by the federal gov- ernment, when data for the Census of Manufactures are collected by mail, it is necessary to send agents to secure delayed reports. When sampling is deemed satisfactory as a method of investigation, there are alternative methods of approach. In a study of limited scope, one investigator may make the plan and collect all the data personally, but as a rule some other method of collection must be employed. Agents may be sent out to secure replies to a list or schedule of ques- tions, or the personal element may be abandoned entirely in favor of the use of questionnaires sent through the mail. The two methods are sometimes combined when mail questionnaires are sent out to all from whom information is desired and after a reasonable period of time has elapsed agents are sent to those who have not replied. A variation of this method is employed when agents collect data in thickly populated centers and mail questionnaires are sent to respondents in less accessible regions. The detailed methods of collecting information by using agents or the mail will be presented in chapter VI, but the 60 BUSINESS STATISTICS decision as to which method to employ is an essential part of the preliminary plan and rests upon a number of considerations. Importance of Personal Element. The function of the agent is to create a favorable attitude toward his mission, to explain doubtful points concerning the investigation, to encourage the informant to provide the desired information, and to record responses. These things cannot be done with a mail questionnaire. It loses immediately what- ever value inheres in the personal contact between an agent and an informant. The form and tone of the mail questionnaire should be designed to supply as far as possible the missing personal element, but the fact remains that a mail questionnaire is an impersonal appeal for information and the investigator must expect it to be treated as such. Area Covered. Investigation within a single city or local area can usually be done more thoroughly and in less time by agents. The agents can also be directly supervised and their completed schedules checked as they are turned in. All of these things add to the accuracy of the work. When a larger area is to be covered, direct supervision is impossible, a larger number of agents cannot be as well trained, and the value of direct contact with the informant is greatly diminished because of poorer agent technique. In general when a large area is to be covered mail questionnaires should be used, whereas agents are superior for investigations confined to small areas. There are, of course, exceptions to this rule as the subsequent discussion will show. Time Element. It is very difficult to confine an investigation using mail questionnaires to a fixed time period. The questions may require only five minutes to answer, and the need for immediate answers may be quite clear to the respondents, but actual experience shows that the replies will straggle back over a period of time. The usual pro- cedure is to close the collection arbitrarily after a reasonable period has elapsed and enough replies have been received to permit analysis. The more involved the questionnaire, the greater the uncertainty as to when the replies will be received. In planning the successive steps of an investigation using mail questionnaires, it is never advisable to allot a certain period such as two weeks or one month for the replies to be returned. Unless some flexibility is introduced into the time plan, the subsequent steps of the work are likely to be disorganized by unexpected delay in receiving filled-in questionnaires. An example from the writer's experience will illustrate what can happen. An association of worsted yarn manufacturers requested an PRELIMINARY PLANNING OF INVESTIGATIONS 61 investigation of their equipment and trends in the production of differ- ent kinds of yarn. A questionnaire was prepared and submitted to the supervising committee of the association for approval. After making several suggested changes the form was printed and distributed to the 96 members of the association during a meeting at which practically all members were present and agreed to co-operate by returning the information within 30 days. At the end of 30 days about 20 completed questionnaires had been received. Another month elapsed during which an additional 10 or 12 replies were received. A "follow-up" letter was then sent to all delinquents. This brought another 20 replies within a month. During the next six months cajolery, personal visits, and personal favors brought the total of completed replies to 70. The work of analysis was completed just about one year after the investiga- tion was initiated. In contrast to the uncertainty encountered in this example, the use of agents permits the establishing of a definite time schedule. The agents can be allotted fixed amounts of work and their operations can be carefully supervised. If two months, for example, have been allowed in the plan for agents to collect the data, it can be definitely expected that at the end of the two months all reports will be turned in. Nothing as certain as this can be anticipated when mail ques- tionnaires are used. Percentage of Replies. Where agents are used an investigator can lay his plans to get a certain number of cases and enough agents can be put in the field to secure the desired number within a specified time. No equivalent certainty concerning the number of cases can be introduced when mail questionnaires are used. Usually a large part of those to whom questionnaires are sent will disregard them; hence a return of 10 to 20 per cent on an ordinary investigation is the likely response. However, there are in particular cases circumstances which may result in a much higher or much lower percentage of return. Actual experience with questionnaire technique leads to certain gen- eral explanations of the small proportion of replies. These can be listed as follows: a) Some individuals and certain classes of the population have an aversion to giving any information under any circumstances. Others intend to reply but fail, due simply to inertia. ) The questionnaire method has been over-used to such an extent that busy men throw all questionnaires into the wastebasket. 62 BUSINESS STATISTICS c) The sponsorship of a well-known individual or agency may in- crease the percentage of replies, and the absence of any such identifi- cation may affect the percentage of replies adversely. d) As a rule, the shorter the list of questions, the higher will be the percentage of replies. e) Simple questions with "yes" or "no" answers will bring a better response than complicated questions. /) When the respondents have a direct interest in the subject mat- ter of the questionnaire, or when they will receive some personal or group benefit such as a premium, a free sample, or a copy of the results of the study, the percentage of replies will be above the average. These are some of the factors that lead to the low percentage of replies received from a mail questionnaire and they must be taken into account in estimating how many questionnaires should be sent out in order to get a desired number of replies. Cost. The question of cost is closely related to area covered. In a local investigation the agents can be assembled for training at nominal expense and their transportation costs while in the field are small. In a larger investigation centralized training means transportation for the agents to the training point and back to the field, while decentralized training involves transportation for the training staff or the establish- ing of a number of training staffs. All of this is not only expensive but extremely cumbersome. In an investigation covering a large area, mail questionnaires are usually less expensive than schedules collected by agents. Suppose that 8,000 letters were sent out in an investigation and 1,500 replies were received. The cost except for preparation of the questionnaire would be about as follows: 8,000 envelopes at $4.50 per thousand ...$ 36.00 8,000 business reply envelopes at $6.50 per thousand . . . 52.00 8,000 addressing, folding and insertion at $2.25 per hundred . . . 180.00 8,000 stamps at $ .03 each 240.00 1,500 stamps (business reply rate) at $ .04 each 60.00 Total cost $568.00 Cost per reply (568 -i- 1,500) = $ .38. The estimated cost turns out to be 38 cents per questionnaire which is probably less than the cost of using agents. On the other hand it does not always follow that agents are cheaper for a local investigation. If this same study were made within a single PRELIMINARY PLANNING OF INVESTIGATIONS 63 city, the expense of postage would be reduced making the cost about 32 cents per questionnaire. If a corresponding estimate of the cost of using agents turned out to be more than 32 cents per schedule, then mail questionnaires would be cheaper even though the investigation were local in scope. Amount and Complexity of Information. If the number of ques- tions is small, answers can be obtained by mail. A long list of ques- tions practically precludes the use of the mail questionnaire because too few replies are likely to be received. To get replies to a long list of questions requires the persuasion of personal contact between agent and informant. Also the information which can be obtained by mail must be relatively simple. Questions which require lengthy explanations or interpretations or information which is difficult for the respondent to give, particularly if long statements are necessary to answer ques- tions, all tend to reduce the number of replies by mail. When an investigation involves asking questions of this sort agents should be used. Replies by mail will not be satisfactory either in number or in accuracy. A practical example will illustrate the circumstances under which one method is more suitable than the other. A research bureau collects retail prices of food articles monthly from 25 grocery stores and monthly sales from 50 drugstores. All of these stores are located in one city, yet the bureau uses agents to collect food prices but mail questionnaires to collect drugstore sales. The difference lies in the fact that any clerk can give the food prices or the agent himself can take them from the price tags. Also the food prices are available any time the agent appears at the store, whereas the sales figures for drug- stores are not made up until the manager or owner has time to prepare them. An agent might have to make several visits for the data. Further, the grocer would not bother to write down the prices of the 42 articles which appear on the schedule, but the druggist does not object to transferring a single sales figure from his ledger to the bureau's collection sheet. These examples illustrate the kinds of facts which can be obtained by agents and by mail questionnaires. Type of Information. Quite apart from the question of complexity there are certain types of information which can be obtained better by mail, others are more suitable for collection by agents. Mail questions must not offend. The same, of course, is true of the questions on a 64 BUSINESS STATISTICS schedule in the hands of an agent. However, the agent can get per- sonal information which cannot be obtained by mail. Skillful interview- ing may procure confidential information on subjects which would be offensive in the absence of the personal element. In 1936 the United States Public Health Service collected data by the use of agents from thousands of families all over the country on subjects entirely beyond the reach of mail questionnaires. Here are some examples from that schedule: 1. What disabling illness occurred in the family during the past year? 2. Is there other handicapping disease or condition? 3. Has anyone in this home ever been examined for tuberculosis? 4. Has anyone in this home been to a health clinic or health center during the past year? 5. Is any member of the family crippled, deformed, or paralyzed? 6. What is the annual family income? This schedule had 64 questions, most of them on a par with those given, which in each instance asked for details as to conditions, treat- ment, and physician in attendance. Information of this sort could not have been obtained by mail. Bias. When questionnaires are mailed to a list of business con- cerns or persons, that list has been selected as a representative group from which to obtain the desired information. At that point, however, the investigator's control over the group ceases. Some will reply, others will not. Are those who reply representative of the entire group? Experience shows that when a request is made to business men many who are not able to make a favorable report on the information re- quested will not reply at all. This tendency introduces a definite bias into the results and greatly reduces the value of the questionnaire method. An equally disconcerting bias enters when questionnaires are sent to individuals. Those with more education or more experience are likely to reply, whereas whole segments of the population which one may wish to reach will disregard the questionnaire entirely. In general then, a bias is likely to appear in the replies to a questionnaire because the ones who reply are not representative of those to whom the questions were sent. Notice that this is quite apart from any tendency of respondents to give biased answers, a difficulty which the investigator faces whether the data are collected by agents or by the use of questionnaires. PRELIMINARY PLANNING OF INVESTIGATIONS 65 Summary. In deciding whether to use agents or mail questionnaires in a particular case, all of these factors must be taken into account. Sometimes one will be determining, again the balancing of all of them will point to the preferable method. Occasionally there is an advantage in using a combination of the two methods. A schedule of questions can be sent to the informant by mail with a request that it be given preliminary consideration pend- ing the arrival of an agent at a later date. This method is effective in investigations requiring complex information or where it is necessary to assemble the information from various offices of a business concern. The work can be done prior to the arrival of the agent, but the agent can go over the schedule to be sure the questions have been interpreted correctly. A modification of this method is used by the Department of Commerce in taking the Census of Business. PREPARE A STATEMENT OF THE PROGRAM In the course of attending to the various details arising in the preceding steps there is a chance that some essentials may have been overlooked, or that points originally included in the plan will sub- sequently be forgotten. To prevent such contingencies the entire pro- gram should be put in writing. There are several advantages to the investigator in doing this. It forces him to regain proper perspective with respect to the investigation. It permits him to pick up any loose end in his plan. It gives him a complete statement to which reference can be made in the future, if puzzling situations arise. It provides a preliminary outline for writing the final report. The statement of the program should be submitted to the sponsor 5 of the investigation for approval. This step is particularly necessary when the project involves direct collection of data but is applicable to some extent even though the data are to be taken from library sources. All too frequently misunderstandings between investigator and sponsor arise subsequently because of failure to come to an agree- ment at the beginning as to exactly what will be done and how it will be done. The investigator should present his program in writing and in return insist upon a written approval from his sponsor. 8 "Sponsor" means the organization or individual authorizing the investigation. Hence the sponsor may be a board of directors, a board of trustees, a higher executive, an advertising agency, etc. 66 BUSINESS STATISTICS PROBLEMS 1. Each of the following is a statement of a problem for investigation. Rewrite any of them that fail to define the problem completely. a) Retail sales taxes have no adverse effect on the sales of cigarettes. b) Between 1938 and 1941 the movements of prices on the New York Stock Excliange can be explained largely by charting with them the series of crises and tensions in European affairs. f) We (management) know that the change in the time of introducing new models of the Kistler automobile fiom January to November has changed the sales curve, but we are in doubt whether the expected decrease in the peak and trough of sales has occurred. Prepare a report on this question for the meeting of sales representatives on Sep- tember 23. 2. A student was given the following assignment in a statistics class in 1935. "Has the center of the slaughtering and meat packing industry been moving westward during the past 40 years?" The student read The Jungle by Sinclair, a story of conditions in the industry in Chicago. He then pro- ceeded to collect data on the number of head of cattle, sheep, and hogs shipped from each state of the United States, and the number of animals slaughtered at various important cities, such as Omaha, Kansas City, Chi- cago, and Buffalo. He also collected data on the livestock receipts at principal markets. The student then sought help in completing the work. a) What criticism would you make of his work to date? b) How would you advise him to proceed? 3. Which of the following are library and which are direct sources? a) The price of wheat is obtained from a daily paper. b) The sales of retail drugstores in a community are reported by the indi- vidual stores monthly to a research bureau which issues a monthly report of combined sales to the reporting stores and to newspapers. c) An advertising agency calls residences by telephone to inquire whether the radio in the home is in use. d) The federal income tax law requires that a copy of all tax returns be kept on file for public inspection in local offices of the Bureau of Internal Revenue. A student prepares a study of income distribution in his city based on these duplicate tax reports. 4. Investigate each of the following in the reference given to determine whether the method of collection is by sample or census. a) Each year, beginning in 1935, the Department of Agriculture publishes complete information on agricultural production. Agricultural Statistics, United States Department of Agriculture, pp. 1-5 (approximately). b) The net profits of corporations as compiled by the Federal Reserve PRELIMINARY PLANNING OF INVESTIGATIONS 67 Bank of New York. Survey of Current Business, 1938 Supplement, United States Department of Commerce, pp. 64 and 180. c) The value of production of manufactures in the United States. Biennial Census of Manufactures, any issue, United States Department of Com- merce. (The description of method is found at different places in different issues. In the 1925 Census, for example, the description of method is found on pp. 3-6.) d) The loans and investments of reporting member banks of the Federal Reserve System in 101 cities. Survey of Current Business, 1938 Sup- plement, United States Department of Commerce, pp. 55 and 178. 5. State in each of the following examples of collection whether agents or mail questionnaires should be used and whether the census or sample method should be used. Give reasons for answers in each case. a) A city welfare organization wished to make an investigation of the extent to which families receiving city relief were paying money on installment purchases. b) A city restaurant association wished to study the distribution of ex- penses of doing business of its 53 members. c) An advertising agency wished to inquire from the owners of a certain make of automobile whether they would purchase the same make of car again. d) A corporation wanted information concerning how many of its 4,500 employees were home owners, the value of their homes, and where the homes were located. 6. In each of the following examples the student is expected to lay out a preliminary plan for the collection of data, giving explanations of pro- cedure and reasons for choice where alternate methods are available. a) A study of vacant dwellings in the community in which your college or university is located. The purpose of the study will presumably be to determine: (1) the percentage of dwellings vacant, (2) what types of dwellings have the highest and lowest vacancy ratios, (3) the sec- tions of the community having the highest and lowest vacancy ratios, (4) the relation of vacancy to age of dwellings, (5) allied questions that you may care to include. b) An automobile manufacturer advertising in newspapers and maga- zines, on billboards, and by radio wishes to discover which type of advertising is most effective in drawing the attention of the public to his product. c) A manufacturer of a well-known brand of toilet soap wishes to dis- cover by a direct appeal to consumers why the sales of his product have declined during the past year. d) A state milk control board wishes to find the variations in the price at which whole milk is sold in retail stores in the state. 68 BUSINESS STATISTICS REFERENCES BOWLEY, ARTHUR L., Elements of Statistics. London: P. S. King and Son, Ltd., 1920 (fourth edition). Pages 14 and 15 contain a brief but effective statement of the preliminary planning of statistical investigations. BROWN, LYNDON O., Market Research and Analysis. New York: The Ronald Press Company, 1937. Chapter 8 gives a brief statement of the fundamentals of planning an investigation. CHAPIN, F. STUART, Field Work and Social Research. New York: The Cen- tury Company, 1920. Chapters I, II, and III are devoted entirely to the preliminary planning of investigations. EIGELBERNER, J., The Investigation of Business Problems. Chicago and New York: A. W. Shaw Co., 1926. Chapters I- VI provide general background for the principles of collecting data. SANDERS, ALTA G., and ANDERSON, CHESTER R., Business Reports. New York: McGraw-Hill Book Co., Inc., 1929. Chapters VI and VII give a detailed statement of the preliminary work involved in the collection process. SCHLUTER, WILLIAM C, How To Do Research Work. New York: Prentice- Hall Inc., 1929. Chapters I-X contain a painstaking description of the preliminary steps of investigation. SPAHR, WALTER E., and SWENSON, RINEHART J., Methods and Status of Scien- tific Research. New York: Harper and Bros., 1930. Chapter X contains a summary of methods of collecting data. CHAPTER V SAMPLING RELATION TO KNOWLEDGE MUCH of the world's knowledge is based upon inferences drawn from observation of samples. Finding the skeleton of a giant mammal embedded in rock strata demonstrated to have been on the surface of the earth 100,000 years ago, the paleontologist deduces the fact that such an animal lived in that period and then generalizes that this animal was typical of many alive at the time. One example has been found; therefore many others like it must have existed. A lumber jack taps along the side of a fallen tree with his axe and, listening to the sound, determines how far the tree is hollow and just where it becomes solid to the heart. His past experience in tapping logs represents a large sample providing the knowledge to be applied to the new log and his judgment is usually correct. A public speaker, wishing to drive home a point to his audience, illustrates with a story or an experience because he has found that this method of emphasis is the most effective. The response he has obtained in the past represents a sample the results of which are a part of his platform technique. These illustrations exemplify the extent to which sample experience becomes the guide to current action. Similar examples could be cited in every field of knowledge. The notion of sampling and the generalization of the results of sampling are in no sense peculiar to statistical work. Sampling is, however, particularly impor- tant in the field of statistics, because the numerical character of the subject lends itself to exact development. THE IMPORTANCE OF SAMPLING Sampling techniques are seldom necessary in internal statistical work but have their greatest application in external work. In the latter case it is seldom possible to obtain all of the data pertinent to a given statistical universe, 1 hence the usual situation requires that results be x The complete category of data from which a sample is drawn is known as a statistical universe or statistical population. In the preceding chapter an example was presented in which prices of groceries were collected monthly from 25 grocery stores in a city. The 25 stores are a sample of the universe or population consisting of all grocery stores in the city. 69 70 BUSINESS STATISTICS obtained from the study of samples. ;Thus we have: indexes of com- modity prices based on a few hundred of the thousands of commodities that are traded daily; average hourly wage rates in manufacturing plants determined from samples including no more than 5 per cent of factory workers; the market for a product estimated from the results of sending a questionnaire to 1 or 2 per cent of the potential users. These examples are sufficient to indicate the importance of sampling in statistical work. When a retailer buys shoes from a salesman he expects that they will be just like the sample which the salesman shows. In the same way we might expect to estimate the average age of 2,000 freshmen in a university from the ages of a sample of 100 of them attending a freshman lecture. Certain differences between these two "samples" will immediately occur to the reader. The shoes are all made on the same machinery, defects are weeded out by inspection, and uniformity is assured at every step of the manufacturing process. Therefore, one shoe picked at random does represent the entire lot. On the other hand we cannot be sure that the sample of 100 freshmen is repre- sentative as to age. The lecture may have attracted only more mature students, or a brilliant younger group who completed high school in three years. ( The absence of control over statistical data is precisely what makes it necessary to develop principles and methods of sampling. We wish to know something about a certain universe of events or facts, but are unable to make a complete enumeration. Instead we must record specific facts concerning a sample drawn from the universe a sample which shall be representative of the universe. The problem is, How can such a sample be obtained? THE PRINCIPLE OF STATISTICAL REGULARITY If our knowledge of the universe is limited how can we ever know that a sample drawn from it is representative? The answer to this question comes from a principle which is as broad in its application as the laws of nature. It is known as the Principle of Statistical Regu- larity and may be stated thus: A sample selected at random from a universe will exhibit the characteristics of the universe, even though the number in the sample is small compared with the universe. The simplest illustrations of the operation of the principle occur in coin SAMPLING 71 tossing and dice rolling. Every throw is exactly like every other one and the experimental material, i.e., coins or dice, remains constant' The result of an experiment with coins is presented in Table 6. Ten coins were used and the results in the first four lines of the table are for groups of 50 throws each, or 500 coins. The 240 heads and 260 tails obtained in the first trial of 500 varied 4 per cent from the expectation of 250 each. The second trial gave 245 heads and 255 tails, a cumulative result of 485 heads and 515 tails in the first 1,000 coins. The cumulative result varies 3 per cent from the expected 500 of each. In successive rows of the table the results for the third and fourth trial of 50 throws, the third and fourth hundred throws, and TABLE 6 THE PRINCIPLE OF STATISTICAL REGULARITY ILLUSTRATED BY COIN THROWING NUMBER OF THROWS OF TEN COINS EACH RESULT CUMULATIVE Actual Result Expected Result (Equal Number of Heads and Tails) Per Cent Variation from Expected Result Heads Tails Heads Tails 1st 50 240 245 253 246 501 539 2,024 1,923 1,999 2,036 2,009 2,007 2,015 2,000 1,993 1,959 2,013 260 255 247 254 499 461 1,976 2,075 2,001 1,964 1,991 1,993 1,985 2,000 2,007 2,041 1,987 240 485 738 984 1,485 2,024 2,024 3,949 5,948 7,984 9,993 12,000 14,015 16,015 18,008 19,967 21,980 260 515 762 1,016 1,515 1,976 1,976 4,051 6,052 8,016 10,007 12,000 13,985 15,985 17,992 20,033 22,020 250 500 750 1,000 1,500 2,000 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 20,000 22,000 4.00 3.00 1.60 1.60 1.00 1.20 I 20 1.28 .87 .20 .07 .00 .11 .09 .04 .17 .09 2d 50 3d 50 4th 50 3d 100 4th 100 1st 400 2d 400 3d 400 4th 400 5th 400 6th 400 7th 400 8th 400 9th 400 10th 400 llth 400 so on are shown. The last column of the table shows how the per- centage variation from the expected number tends to decrease as the size of the cumulative sample increases. At the end of the first 400 throws (4,000 coins) the variation is 1.20 per cent. At the end of the second 400 throws the variation increases slightly to 1.28 per cent and then decreases regularly through the third, fourth, and fifth groups of 400 throws and reaches zero at the end of 2,400 throws. The exact result obtained at this point is purely accidental, as is the exact way in which the percentage variation declined with the increase in size 72 BUSINESS STATISTICS of the sample. The important point is that the percentage variation becomes smaller and smaller as the size of the sample increases and that in spite of slight deviations it remains small through the seventh, eighth, ninth, tenth, and eleventh trials of 400 throws each. 8 Examination of the result columns shows that sometimes the num- ber of heads is greater than the expected number and at other points the number of tails is greater than expected; there is no indication of any fixed bias nor any tendency for either heads or tails always to exceed expectancy. Specifically, for instance, after 400 throws there were more heads than expected, but after 800 throws there were more tails than expected. This difference in the direction of variation to- gether with the regular reduction in the percentage variation indicates the tendency toward regularity of the results as the size of the sample increases. The tossing of 44,000 coins has demonstrated the principle. If the tossing were continued, the percentage variations could be ex- pected to diminish. A further demonstration of the operation of the principle of statis- tical regularity is presented in Table 7, showing the results of throwing five dice. The first line of the table gives the results of the first 20 throws of five dice (100 faces showing or, as recorded in the first column, 100 dice) . The maximum variation from the expected num- ber is the appearance of 22 fives. This variation of 32 per cent is due to the small size of the sample. The decline in the variation of the actual from the expected result can be seen as the size of the sample is increased. There are two places at which the progressive decline in variability is broken when the cumulative sample consists of 300 dice and when it consists of 3,600 dice. These two exceptions to the operation of the principle of statistical regularity do not disprove its universality. The experiment was carried on with ordinary commercial dice and they were thrown within a confined space rather than being permitted to come to rest without obstruction. Either circumstance might be sufficient to explain the two irregularities that appear. In both examples the expected occurrence of the recorded events is known. That is, coins should fall heads and tails with equal frequency; one face of a die is as likely to turn up as another. Consequently 2 The reliability of a sample increases proportionally to the square root of the num- ber of cases in the sample. Thus to double the reliability, i.e., to halve the variability, the number of cases must be four times as great. The reason for this relation will be apparent from the form of the formulas for standard error in chapter XXIX. SAMPLING 73 s s Q B o -i a M d ! o w _i cu JHil K u'Cri* a s*^ sis-. H- Ex No. o Eac 114 62 HrH si w< 83 H*H S S ft S M S2^3 SHfegS ? fe M o w Ou^J w dbS w4< fc IB! iss S8S2SSS <N <N \o 06 09 OB en "* CO CN <N r-l CTJrW >o n o o o -H rnr> o o OOO ooo 00 O CM < - O\ <* 00 rH r-l CO IT* I--* GN l <N fO 00 00 OO OS 00 00 00 00 O N rH "^ o xt r-i f-i rOVO 00 O fM^JJ r^ vo n IA \o o i o o\ 1-1 <N o '- vo rr> i-* WM-H o r- Oi-* -~iOOOOOOOOO r-CN fM < ^NOr^C\'-<rrk r-> 1-4 r* ^ Q OO-*C\OCNOr-t - Oor^o<N c\xjoo --t<r\c\ ONOO rH CM <N <M (N rH r-l r^ o vo fo -" ^ ^N OOOr-tr-tOOOO r^^ vo r- oo ^r vo o\ r -< CN -i ir\ r-i r-i O\ 00 00 OS OS O Xf r-i 00 r- <N <N r^ 00 Xf cr> (M r- rHCNr-irc\ oosf x -'-'-<o\o (N rH r-l r< <M rH CM ooooooo ^^.SmO OOOOOOO o o o o o ooooooo ^r-lrHCTkNO ^ *! <*! 1 **! *! **! 74 BUSINESS STATISTICS these are controlled experiments carried out to show how the principle of statistical regularity operates. Consider another example. A teacher made a practice each year of having each of his students measure the width of his desk. The same ruler was used year after year, but the students' results varied individually and from year to year. The ruler was subdivided by 32ds of an inch and the students were told to read to 64ths of an inch. The yearly average for eight years is shown in Table 8. TABLE 8 WIDTH OF A TEACHER'S DESK ACCORDING TO MEASUREMENTS BY EIGHT DIFFERENT GROUPS OF STUDENTS YSAB NUMBER OF STUDENTS AVKKAGX MEASURED WIDTH OF DESK m INCHES 1st 32 48.6) 2d 26 48.49 3d 30 48.61 4th 31 48.60 5th 28 48.37 6th 36 48.39 7th 31 48.62 8th 28 48.60 The exact width of the teacher's desk is unknown, yet these averages perform in the same way as the observations of coin and dice throw- ing. As the number of heads sometimes exceeded the number of tails and vice versa, so in the same way some of these averages were slightly above the theoretically true width, 8 others slightly below. The example shows that results from samples tend to group them- selves about an unknown true value just the same as they group themselves about a known true value. The coins and dice are samples in which the observations involved only counting. The desk data involved measurement. We conclude therefore that the principle of statistical regularity applies to both counted and measured samples. A major question arises at this point. Will the same regularity appear when the universe is less homogeneous 4 than coins, dice, or 8 The theoretically best estimate for an unknown value is the arithmetic average of a large number of independent observations of that value. 4 "Homogeneous" as used in statistical work means sufficiently alike to be used for the immediate purpose as though equivalent. For example, coins and dice are truly homo- geneous in the sense that each one is identical with every other one, but human beings, animals, and other materials dealt with in statistical work are treated as though they were homogeneous even though appreciable differences of size, weight, and other characteristics appear within the groups. "Non-homogeneous," or "heterogeneous," means possessing characteristics which are sufficiently different to require classification in different categories. SAMPLING 75 the width of a desk? The answer can be obtained from another ex- ample. The problem was to find the average number of letters in the last names of persons having telephones in Buffalo, New York. Each page of the telephone book contained four columns. A ruler was laid across each page of the book near the middle of the page and the number of letters in the name appearing above the ruler in each column was counted. Four samples were taken, the first containing a name from the first column of each page of the book, the second a name from the second column and so on. There were 265 pages in the book, hence each sample contained 265 items. The average numbers of letters in the names in the samples were: 1st sample 2d sample 3d sample 4th sample 6.51 letters 6.52 letters 6.51 letters 6.54 letters The similarity of the four results shows how the principle of statistical regularity operates. These samples were chosen entirely at random, 5 yet any one of them alone presumably would have represented the universe. Each sample contained only 265 out of a total of about 70,000 names in the telephone book, yet there is little doubt that the average number of letters in the last names appearing in the book is about 6.5. Note that we do not expect the sample to give the exact characteristics of the universe but rather an approximate indication of those char- acteristics. The four samples vary slightly and probably each one varies somewhat from the true value. Such variations will always appear in samples. In fact the methods of analyzing data which will be devel- oped in subsequent chapters include the measurement of the expected variation of the characteristics of a universe from those characteristics found in a sample drawn from it. In the previous examples chance has operated in each case. The chances are equal of getting either heads or tails in tossing a coin; the chance that any one face of a die will turn up is one-sixth. In measuring the width of a desk overestimates and underestimates are equally likely. In the telephone book there is just as much chance that "Fry" will be printed near the middle of the page, as "Frendenberger." i The cases which arise in practical business affairs are usually not 5 The concept of random selection of cases for a sample is explained in chapter IV, p. 58, footnote 3. 76 BUSINESS STATISTICS so simple as these examples. The operation of pure chance which is so evident in the examples will be for the most part lacking in prac- tical work. The investigator is forced to deal with conditions as they exist. In general more variables will be present and as a result adjust- ments become necessary. The real problem of sampling is to find methods of selecting the cases for the sample so that the characteristic to be measured or counted has a chance of occurring in the sample in the same proportion as it occurs in the universe. The amount of care required to do this will be evident when it is remembered that the extent of occurrence of the characteristic being studied is unknown in the universe and can only be inferred as the final step in the analysis of the sample. This would be circular reasoning were it not for the principle of statistical regularity. Some of the characteristics of the universe may already be known, and if these known conditions of the universe can be reproduced on a small scale in the sample, then the operation of the principle is all that is needed to allow us to infer from its occurrence in the sample the extent to which a given unknown characteristic is present in the universe. ; THE TWO PROBLEMS OF SAMPLING There are two major factors to be considered in obtaining a sample: (1) how many cases must be included to obtain reliable results and (2) what cases must be included to secure representativeness. The Size of A Sample The first problem is, How many or what proportion of the cases in the universe must be taken for the principle of statistical regularity to operate? There is no numerical answer to this question. It would be wrong to say that a 50 per cent sample or a 10 per cent sample will be satisfactory. In fact such an answer is meaningless in coin tossing where the universe is infinite. Even when the universe is lim- ited, as in the telephone book example, we do not attempt to say that a certain number of cases or a certain percentage of the total number of cases in the universe will be a large number. The telephone book used in this test experiment contained about 70,000 names. A sample of 265 was therefore only about four-tenths of 1 per cent, yet the results obtained from the four independent samples fell within a very narrow range. SAMPLING 77 The question of how many cases to include in a sample must be decided for each problem separately. The number depends primarily on the degree of reliability required and the diversity of the charac- teristics present in the universe. The tests for reliability are developed in detail in chapter XXIX. The question of diversity of characteris- tics can be discussed at this point. If the universe is as strictly homo- geneous as the letters in names in a telephone book, a very small sample will suffice for the purpose of determining the average number of letters per name. On the other hand, if a sample from the telephone book were used to determine the percentage of subscribers who used four-party service, a much larger sample would be required to insure that proper provision was made for the tendency toward use of this type of service in different parts of the city, for the inclusion of mainly residential subscribers since few business places use four-party service, and for the exclusion of those exchanges, if any, which do not provide four-party service. i This example demonstrates further the importance of a previous statement, that the question of homogeneity of the universe depends upon the purpose for which the sample is taken. Thus only a relative statement can be made concerning the size of a sample. If the events in the universe differ only with respect to the characteristic which is tested by the sample, a sample as small as one-tenth of 1 per cent of the universe may be adequate for the principle of statistical regularity to be effective. As the number of characteristics which vary in the universe increases, the size of the sample must be increased, sometimes becoming as large as 10 per cent of the universe. If a sample greater than 10 per cent is required to reproduce the characteristics of the universe, the universe itself is probably not sufficiently homogeneous for the principles of sampling to be used. Methods of Securing a Representative Sample The second problem is how to secure a representative sample. In particular, What cases shall be included in order to set up in the sample a pattern which will reproduce on a smaller scale the conditions of the universe? In some cases none of the conditions of the universe may be known, while in other instances information is already available concerning the distribution of certain characteristics. Consequently there are two methods of securing representativeness: (1) uncontrolled sampling and (2) controlled sampling. 78 BUSINESS STATISTICS Uncontrolled sampling. If little or nothing is known about the distribution of any of the characteristics in the universe, an uncontrolled sample is the only one which can be used. Example I : Suppose a tobacco retailer wishes to make a consumer investigation of the question, "What brand of cigarettes is most pop- ular in this city?" It would be difficult to find a "control" in this case, because nothing is known regarding the characteristics of cigarette smokers as a group of the population. It is known in general that children do not smoke cigarettes, but just what the proportion of cigarette smokers is in each adult age group would be hard to estimate. It is not even known how the percentage of men cigarette smokers compares with the percentage of women smokers. If there had been some recent nation-wide study showing what percentage of each sex smokes cigarettes, these two percentages would provide a control to determine the proportionate number of men and women from whom replies should be obtained in this study. In the absence of any such control, the most obvious method is to take cases from the universe as they come to hand, making no choice of any kind. Even this involves some selection of time and place for taking interviews. The method must insure a rough representation of all the general characteristics of the adult population, on the assump- tion that, provided the sample is large enough, smokers of various brands of cigarettes will be included in correct proportion. That is, since age, sex, nationality, economic class, or other characteristics may be determining factors in the choice of brand of cigarettes, the answers must come from persons who are representative of the total adult popu- lation in as many respects as possible. It would not serve the purpose, therefore, to distribute questionnaires only at women's clubs, or only at an industrial plant, or to interview only relief clients, or only people on the street at three o'clock in the afternoon. But if a busy down- town corner were selected, at an hour when all classes of men and women, employed as well as unemployed are likely to be on the street, the passers-by should be fairly representative of the entire adult popu- lation. If stopped and asked what brand of cigarette they buy, some would reply, others would ignore the question; some would be cigarette smokers, others would not smoke cigarettes or would not smoke at all/ The investigation might show that six hundred and thir- teen cigarette smokers replied and the answers were tabulated as fol- lows: Brand "A," 18 per cent; Brand "B," 16 per cent; and so on. SAMPLING 79 1 The important feature of this method is the absence of control of the sample. Experiments have shown that reliable results can be obtained by this method only through the use of a large sample. The distribution of none of the characteristics in the universe is known; therefore a large sample must be taken to insure that the pattern of the universe will be reproduced. This uncontrolled plan of collection is known also as the extensive method of sampling. Example 2: The use of this method in an investigation is illustrated by the chain store inquiry of the Federal Trade Commission conducted in 1928 and published in 1930-31 The sample was obtained in the following manner: A mailing list of chain stores was prepared by the commission from various lists of chains, including those of the Chain Store Age and the National Asso- ciation of Real Estate Boards, supplemented by telephone directories, trade journals, and city directories, all of which were checked to eliminate, so far as possible, duplications. When completed, this mailing list for the selected groups of chain stores included slightly over 7,500 names. The results obtained from this mailing list are shown in the following tabular statement: Schedules mailed 7,515 Returned by post office 713 Duplications 638 Non-chain establishments only 1,282 Co-operative group only 39 Reported out of business 492 In receivership, no records, or records destroyed, etc 833 Unobtainable at time of tabulation 1,596 Total eliminated 5,593 Schedules returned 1,922 6 Only 1,727 of the 1,922 schedules were usable in the analysis but the Commission appraised their representativeness as follows: Comparing the commission':; data with estimates for the entire field based upon census data, it appears that the commission's study represents approximately one-half of the number of stores operated and one-half of the aggregate sales volume of all organizations engaged in chain-store merchandising in 1929 in the 26 kinds of business covered by this inquiry, including chains of two and three stores, which are not classed by the census as chain stores. On the other hand, the total number of chains represented in the commission's inquiry is estimated to be something under 10 per cent of the total. 7 6 "Scope of the Chain-Store Inquny," Chain Store*, 72d Congress, 1st Session, Senate Document No. 31, p. 9. 7 Ibid., p. ix. 80 BUSINESS STATISTICS The Commission had to treat its data in different classifications, hence the real problem of representativeness arose in the sub-groups. A comparison of sample data with Census of Distribution data based on the parts of each which were considered comparable is shown in Table 9. TABLE 9 PERCENTAGE OF TOTAL CENSUS CHAINS (FOUR STORES AND UP), STORES, AND SALES IN 1929 RFPRESENIED IN THE COMMISSION'S ORIGINAL CHAIN-STORE SCHEDULE RETURNS FOR CHAINS OF Six STORES AND UP FOR 1928* KIND OF CHAIN PERCENT REPRESENTATION OF CENSUS IN COMMISSION'S SAMPLE Chains Stores Sales Food 23.8 17.7 17.2 449 20.8 20.6 34.4 28.7 6.4 4.0 12.1 13.3 76.9 42.8 109.8 80.4 30.6 33.7 55.6 53.5 9.3 7.6 15.7 25.3 76.5 55.5 104.0 89.8 34.7 52.9 56.8 97.8 7.8 11.9 24.8 22.0 Drue Tobacco Variety Clothing, furnishing, and accessories Hats, caps, and millinery Shoe Department store and dry goods General merchandise Furniture Musical instruments Hardware Total 21.8 66.3 69.2 * "Scope of the Chain-Store Inquiry, 1 Document No. 31, p. 28. Chain Stores, 72d Congress, 1st Session, Senate Table 9 is complicated by the fact that the ratios in the three columns have in the numerator results from the Federal Trade Com- mission's sample of chains operating six or more stores in 1928 and in the denominator results from the Census of Distribution, a complete enumeration of chains operating four or more stores in 1929. From this comparison the Commission concluded: The purpose of Table [9] obviously is not to present an exact measure of the proportions of the commission's data either of the chain-store field as a whole or by specific commodities but rather to afford a general impression as to the kinds of business in which the commission data may be regarded as sufficiently comprehensive as contrasted perhaps with those for which the figures should be regarded merely as indicative because of the comparatively small representation in comparison with the census totals. It should, of course, be recognized that the foregoing proportionate com- parisons are approximations, both because of the variations in classification and the necessary treatment of the 4- and 5 -store chains in the commission data. In general, it appears that with the possible exception of general stores and furniture chains, the commission's reports are sufficiently adequate to provide a SAMPLING 81 satisfactory indication of chain store operations in the several kinds of business considered. 8 This sample was obtained without exercising any control over the cases which should be included. As a result only part of the organiza- tions to whom the questions were sent proved to come within the definition of chain merchandisers. The ultimate size of the sample was unknown until the returns had been edited. Even when the size of the sample as a whole was known the representativeness of the sample with respect to different kinds of chains was in doubt until a partial comparison could be made when the results of the 1929 Census of Distribution were published. Finally it turned out that in so far as the comparisons with census results were valid, the various lines of trade were not equally well represented in the sample, although with two exceptions the sample was considered large enough to provide representative information concerning chain-store merchandising in different lines of trade. Controlled sampling. When knowledge of some of the character- istics of the universe can be obtained, the usual practice is to take a controlled sample. A controlled sample is one in which representative- ness is obtained by conscious adjustment of the sample to conform to the conditions existing in the universe according to one or more known characteristics. The known characteristics are not the ones that are being studied in the sample investigation. For instance, in a survey of buying habits of students at a certain university, the number registered in each class is a matter of record at the registrar's office. This known distribution can be used in selecting a representative number of sample cases from each class, and in order to check with this control each student interviewed must be asked what class he or she is in. How- ever, the object of the investigation is not to determine the distribution of students by class, but rather to assemble, by means of sampling, a variety of hitherto unrecorded information regarding the buying habits of all the students. The advantages of the controlled method lie (1) in the substitution of a known representativeness for one hoped for on the basis of size of sample alone and (2) in the small size of sample which can be used. If the information had been available for the chain-store inquiry to have followed this plan of sampling, the first step would have been pp. 28, 30. 82 BUSINESS STATISTICS a study of existing information concerning chain stores to obtain for each line of trade covered by the investigation the best estimate of the number of chains, the proportion of large and small chains, the dis- tribution of sales by lines of trade, and the total sales. Guided by this information the sample could have been planned so as to get the proper representation of large and small chains and of the several lines of trade. Thus a reasonable amount of control over the sample would have been exercised. Controlled sampling may be further differentiated according to the degree of selection used in establishing the controls. If there is 100 per cent selection, leaving no elements to chance in determining the actual cases that appear in the sample, the method is called selective sampling. If a certain degree of selection is used, but the final deter- mination of actual cases is left to chance, this is called the inclusive method. Examples of the latter will therefore cover the entire range between uncontrolled, or extensive, sampling in which the distribution of none of the characteristics of the universe is known, and selective sampling, in which each individual case is picked because of the known representativeness of its general characteristics.! The selective method: An example 9 will show the method by which the investigator, on the basis of his knowledge of the universe, hand- picks a small number of cases which he believes will be a representa- tive sample. Five years ago Mr. William Groom of the Thompson-Koch Company was interested in the possibility of measuring directly in terms of sales the effective- ness of the advertising produced by his agency. Mr. Groom selected four middle-western cities of 35,000 to 50,000 population. He planned to run his experimental campaigns in the newspapers in these cities and to measure results in terms of the sales made through the local drug stores. To this end Mr. Groom enrolled from three to a dozen or more drug stores in each of his cities and paid them to submit each month a statement of the sales made during the month of each of a number of different drug items. In order to be able to generalize from the results, it is, of course, desirable to make any study of retail sales in communities which are more or less repre- sentative. Mr. Groom originally chose his four test towns on the basis of a personal knowledge of the communities, and a belief that these communities were fairly representative of a great part of the country. The thought in using cities of this size and character was that they presented a mixture of urban and rural people and problems and, therefore, were more 9 I.yman Chalkley, Jr., "The Flow of Sales through Retail Drug Stores A Factual Study," Harvard Business Review, Vol. XII, No. 4 (July, 1934), pp. 427-29. SAMPLING 83 representative of the whole country than either the larger metropolitan centers or the purely rural districts. Each city has some manufacturing, some farming, and some general business and professional activities proportioned roughly like those in the country as a whole. Although only four cities were included in the study, the investi- gators expected their results to be representative of the country as a whole. Further the records of 23 drugstores out of a total of about 100 in the four cities were used. These 23 were personally selected by Mr. Groom. Finally the sales of 12 selected items became the data of the study. The selection, in every respect, of the cases to be included in the study is the important point of the example. This description of the procedure immediately gives rise to three questions: (1) Were these four towns representative of the country as a whole as regards the relation of sales to advertising? (2) Were the sales of the 23 stores representative of sales of all drugstores in the four communities? (3) Were the 12 items the proper ones to study? The equivalent of these questions must be raised with respect to the plan for any selective sample. When the investigator feels that he has sufficient knowledge of his universe and of his sample to be able to answer such questions, there is some justification for the use of the selective method of sampling. Under the usual practical condi- tions no such assurance is possible, hence the method should be used sparingly. The danger lies in getting a biased result if the selection should go astray at any stage of planning the sample. ! The inclusive method: Two examples will illustrate various degrees of control in securing an inclusive controlled sample. Example 1 : An advertising firm was asked to make a quick survey of the nation-wide popularity of a certain brand of scouring powder. The firm first selected a few cities which were believed to be representative of the entire country. Local supervisors were appointed, each of whom was familiar with conditions in her own city, and they were asked to select sample areas that would be representative of nationality groups, old and new residential districts, etc., but each containing a variety of income levels. One agent was assigned to each area and was given freedom to select the housewives whom she would interview, except that her total number of interviews must be divided approximately into 20, 40, 30, and 10 per cent of four roughly defined economic classes. (In certain areas adjustments of the required proportions were made according to the known economic levels in the neighborhood.) 84 BUSINESS STATISTICS If toward the end of an assignment an agent found that she had secured a markedly unbalanced proportion of interviews according to economic class, she then had to exercise some degree of selection in choosing the blocks and houses to visit so that her remaining schedules would make up the deficiency. For the most part, however, if the agents used some system of random selection such as calling at every third house or canvassing every other block, they found that they usu- ally had the right proportion of interviews without making any con- scious adjustment. This method permitted some degree of selection according to cer- tain general characteristics within each of three controls of the inves- tigation, (city, area, and economic group) . In spite of this fact, most of the housewives interviewed were chosen solely by chance: they hap- pened to be at home; they happened to live in a block near the car line where the agent started to canvass, etc., all of which were factors that had no effect whatever on their choice of scouring powder. The key idea in this method was that all of the types of families were given a chance to appear in the sample in proportion to the number of each type existing in the community. All of the characteristics of the universe were given a fair chance to be included in the sample. Beyond that point no selection was made of the individual cases that were actually taken. Example 2: In this investigation 10 by inclusive sampling selection was exercised only at the first level of the plan, and at later stages the returns were determined wholly by chance. An investigation was made by the Bureau of Business Research [of the University of Pittsburgh] in the spring of 1931 to determine the cost and the quality of housing accommodations secured by salaried workers em- ployed in downtown Pittsburgh. The housing status of 1,415 families was analyzed. The data for this study were secured by means of questionnaires distributed to salaried workers through their employers. The co-operation of the following types of concerns with offices in downtown Pittsburgh was secured for the distribution of the questionnaires: two public utilities, four department stores, five financial institutions, five industrial concerns, one railroad, and two insur- ance agencies. It is believed that the employees of these concerns represent a fair cross- section of the salaried workers employed in downtown Pittsburgh. 10 Theodore A. Veenstra, "Housing Status of Salaried Workers Employed in Pitts- burgh," University of Pittsburgh Bulletin, Vol. XXVIII (June 10, 1932), pp. 1-4. SAMPLING 83 Questionnaires were distributed among salesmen, accountants, clerks, statis- ticians, engineers, and junior executives. In order to get replies from the type of worker selected for the study, co-operating concerns were asked to distribute the questionnaires to employees described as follows: "Heads of families en- gaged in clerical and executive work in downtown Pittsburgh with salaries of $5,000 annually or less." In general the persons reporting were heads of families engaged in the designated types of work. Salaries in a number of cases were in excess of $5,000 ; but such cases, if otherwise acceptable, were included in the study. A large proportion of the 1,385 persons reporting occupations were en- gaged in clerical work. Those having executive, technical, selling, and account- ing positions were next in order in numbers reporting. Other groups were only sparsely represented. The universe from which this sample was taken included only salaried workers in offices in downtown Pittsburgh who were heads of families. The $5,000 limit automatically excluded high-salaried executives. Thus the limits of the investigation were rather closely defined. The 19 firms were selected because they represented the known distribution of different lines of business in which the desired kinds of workers were employed. It was assumed that the employees of these concerns were representative of the universe as to types of work and salary distribution. Undoubtedly many failed to reply, but, since the original group was so carefully selected, the reply of any employee was just as acceptable as that of any other. As long as a sufficient number of replies was secured the information regarding housing could be considered as representing the entire group. The weighted method: The weighted method of controlled sam- pling is, in its initial steps, the same as inclusive sampling. That is, a definite effort is made to secure cases in the sample that will represent the known characteristics of the universe. As a further step, however, the sample is again consciously adjusted after it has been collected in order to bring it into closer conformity with these known general characteristics. In this process none of the cases is dropped from the sample, but all are grouped and weighted in order to give each group the importance that previous knowledge of the universe indicates it should have. This method was used in predicting the results of the presidential election in 1936, a detailed description of which is provided in an explanation of the work of the American Institute of Public Opinion (Gallup Polls). 86 BUSINESS STATISTICS The weighted-sample technique assumes that it is possible to isolate and measure factors, or groupings, which determine the distribution of the variable in question. The method of the weighted sample tries to choose from the many possibilities the important determinants of voting behavior. The sample is then constructed by preserving in the miniature population the ratios of the selected groupings which hold for the total population. The problem of the selection of the significant groupings in the voting population was solved in this poll by experimentation. Gallup tried distributing straw-vote returns according to various factors. Those which showed an even distribution of ballots between the major candidates were discarded. Five con- trols were finally chosen. First, ballots returned from each state were to repre- sent the correct proportion of the state's population to the national population. Second, the ratio of farm and city votes in each state was to be maintained. Third, the correct percentage of voters in each income group had to be represented. Fourth, the ballots returned were to reflect accurately the propor- tion of young people who had come of voting age since the last election. Fifth, the return was to come from the correct percentage of people who voted for Roosevelt, Hoover, Thomas and others in 1932. The distribution of ballots in the proper proportion is, however, only half the story. Ballots leave the polling office in the proper ratio according to the factors mentioned. But the correct ratio is seldom maintained after the round trip from office to voter to office. As a rule less than one-fifth of the mailed ballots are returned and these tend to come from selected groups. People with intense opinions (reformers, arch-conservatives, radicals) are more likely to return ballots than those who are luke-warm or undecided; more highly edu- cated and economically secure persons take a greater interest in the ballots and feel more free to answer them. The American Institute found that the largest response (about 40 per cent) came from people listed in Who's Who. Eighteen per cent of the people in telephone lists, 15 per cent of the registered voters in poor areas, and 11 per cent of people on relief returned their ballots. Men are more likely to reply than women. These peculiarities in the mail response of the sampled population are counteracted in two ways: using interviewers, and adjusting the final number of ballots according to the original quota-controls. The Institute had some 200 interviewers scattered throughout the nation. The answers they gathered con- stituted one-third of the final return for the Institute poll. Interviewers can be used advantageously where the mail ballot is not likely to succeed: in relief districts, farms, and working class areas. 11 A partial description of the method by which the returns were adjusted according to the control groups may serve as a guide to the general application of this method. The criteria used as controls are established from known data such 11 Daniel Katz and Hadley Cantril, "Public Opinion Polls," Sociometry, Vol. I (1937), pp. 159-60. SAMPLING 87 as United States population by states, farm and non-farm population of each state, age groups of the population, and the 1932 election returns by states. The proportionate distribution of replies received is compared with the distribution in the "control" group for each of these criteria successively in order to secure for the votes cast in the straw ballot a redistribution that will be in every essential truly repre- sentative of the total voting population. The adjustments according to four of the controls are made by states. As an example of the method of procedure, suppose that after the special interviews the straw ballots received from New York State were distributed as in Table 10, according to farm and non-farm voters. TABLE 10 STRAW BALLOTS IN NEW YORK STATE, 1936 ORIGINAL RETURNS FROM FARM AND NON-FARM VOTERS (HYPOTHETICAL DATA) VOTKM CHOICE OF CANDIDATE TOTAL Roosevelt Land on Thomas Farm Number 2,500 53.6 75,000 73.5 77,500 1,975 43.9 25,000 24.5 26,973 25 0.5 2,000 2.0 2,025 4,500 100.0 102,000 100.0 106,500 Percentage distribution Non-Farm Number Percentage distribution Total number The first question is whether the proportions of farm and non-farm voters conform to the census distribution. Of the total 106,500 straw ballots cast, 102,000, or 96 per cent, were by non-farm voters. Accord- ing to the census, however, the population of New York State is 94 per cent non-farm and 6 per cent farm. If the 106,500 straw ballots had been divided in that proportion, there would have been 6,390 farm instead of 4,500, and 100,110 non-farm instead of 102,000. These corrected figures are therefore substituted for the total ballots cast. They are then redistributed as to choice of candidate according to the same percentage distribution that was found from the actual ballots. That is, the percentage distributions shown in Table 10 are applied to the new totals for farm and non-farm giving the choices for each candidate by the farm and non-farm voters as shown in Table 11. The corrected state totals for each candidate are obtained by adding the corrected farm and non-farm votes in each case. It will be noted that the grand total for the state, 106,500, has not been altered. 88 BUSINESS STATISTICS TABLE 11 STRAW BALLOTS IN NEW YORK STATE, 1936 CORRECTED FOR FARM AND NON-FARM POPULATION; DATA FROM TABLE 10 VOTXM CHOICE OF CANDIDATE TOTAL Roosevelt Landon Thomas Farm Number 3,533 35.6 73,581 73.3 77.134 2,803 43.9 24,527 24.5 27.332 32 0.5 2,002 2.0 2.034 6,390 100.0 100,110 100.0 106.500 Percentage distribution Non-Farm Number Percentage distribution Total number The same process can now be repeated for the other three controls within each state starting with the original returns for each of the three. The four totals thus arrived at according to the four criteria can then be averaged to give the true representation for New York State. After similar adjustments have been made for each state the results for the 48 states are ready to be combined according to the fifth criterion, the proportion of each state's population to the United States total. In this last step the number of ballots cast in each state is ad- justed but the actual total number of ballots cast in the United States remains unchanged. Through the introduction of these five independent controls, the internal distribution of cases in the sample might be considerably altered, but such alteration is made in order to adjust discrepancies between the collected information and the known occurrence of the five control characteristics in the universe. As a result of these altera- tions the cases are distributed in the sample so that the unknown variable characteristic, choice of candidate in 1936, can be studied without distortion arising from failure of one or more of the known characteristics to be represented properly. Summary. Before closing this discussion it should be pointed out that the three methods of obtaining a representative sample depend upon the principle of statistical regularity in different ways. (1) In an extensive sample the principle has its purest application when the appearance of the characteristics of the universe in the sample is left entirely to chance. (2) In a selective sample the investigator's knowl- edge of certain characteristics of both universe and sample is substi- tuted for random choices. He is still depending on the sample to give SAMPLING 89 him information regarding certain unknown characteristics of that universe. (3) In an inclusive sample, either unweighted or weighted, the known characteristics of the universe are definitely projected into the conditions of the sample, but the appearance of unknown or un- controlled characteristics is left to chance through random selection of the actual cases. None of these methods can be used in automatic fashion. Careful planning by the investigator will always be needed. His two most valuable assets will be experience and the exercise of good judgment. PROBLEMS 1. a) Would the members of your statistics class be a representative sample of the students of your school as to height? weight? age? grades? hair color? eye color? Discuss. b) Would the members of the class be a representative sample of all college students with respect to the characteristics listed? Discuss. 2. If a large number of samples, each including 400 cases, show an average variability of 4 per cent from a known result, how large a sample would be required to confine the variability to 1 per cent? to 3 per cent? to 8 per cent? to .5 per cent? 3. Why is there less precision in the results of the dice example (Table 7) than in the coin example (Table 6) ? 4. A retail gasoline station proprietor wished to obtain from his customers the following four types of information: The average mileage of cars per gallon of gasoline, the name of the manufacturer of the tires on the cars, the place of residence of the customers, the proportion of customers using premium gasoline. One of these types of information could be obtained only by the census method, one by a relatively small sample, one by a relatively large sample, and representative information on one of them probably could not be obtained either by sampling or census. Identify the four types of information according to the preceding description. 5. Basing your answer on the quoted paragraph at the bottom of page 80, and Table 9, discuss the question of whether the Federal Trade Com- mission's gross sample was representative of large and small chain stores. 6. What are the essential differences between uncontrolled sampling and con- trolled sampling? Between extensive sampling, selective sampling, inclu- sive sampling, and weighted sampling? 7. Given the following information concerning the 25,900 farms in five counties in 1940. 90 BUSINESS STATISTICS COUNTY OWNERS TENANTS Sell Milk Do Not Sell Milk Sell Milk Do Not Sell Milk A 243 437 1,190 946 1,762 608 219 2,166 239 2,913 110 633 2,240 1,412 2,812 542 461 1,892 928 4,147 B c D E Total 4,578 6,145 7,207 7,970 a) Set up the distribution of a sample of 400 cases to be collected from these counties by agents to obtain information concerning "the number and breed of cows used in dairy herds of farmers selling milk, and the average daily production of milk per cow." b) Set up the distribution of a sample of 400 cases collected by agents to obtain information concerning "the difference in living standards, if any, between farmers who sell milk and those who do not" c ) Is the sample of 400 large enough in each of the preceding investiga- tions, i.e., is the number of cases in the sub-groups sufficient to pro- vide for proper operation of the principle of statistical regularity? Could the sample contain less than 400 cases? Discuss. 8. Suppose that an investigation by the sampling method were to be made of the extent of employment, unemployment, and part-time employment in a city of 500,000 population. The committee in charge would have to consider the following methods: A. Using schedules in the hands of agents 1. Visit one house on each side of the street in each block of the city 2. Select in advance representative blocks in the city and visit each house in those blocks 3. Start from a common point with areas whose boundaries run out from the center like spokes in a wheel and instruct the agents to proceed within their areas until a) They have secured 1,000 completed schedules b) They have visited 1,000 houses 4. Get names and addresses of unemployed from the local welfare bureau and names and addresses of employed from 20 leading employers. Visit the former to obtain data on unemployment and the latter to obtain data on employment and part-time employment. B. Using mail questionnaires 1. Address the occupant at number 13 of every series of 100 street numbers, for example, send a questionnaire to the occupant at 13 Englewood Ave., 113 Englewood, 213 Englewood, etc. SAMPLING 91 2. Address every tenth person in the telephone directory 3. Address every twentieth person in the city directory 4. Address the first 50 persons in each letter of the alphabet in the telephone directory. a) Discuss the likelihood of securing a representative sample by each method, (1) in the "A" group, (2) in the "B" group. b) Rate the four methods of each group according to the kind and degree of control exercised in the sample. c) How would you obtain this information by a completely uncontrolled sample? REFERENCES BROWN, LYNDON CX, Market Research and Analysis. New York: The Ronald Press Co., 1937. Chapter 10 deals with the planning of a sample. HARPER, F. H., Elements of Practical Statistics. New York: The Macmillan Co., 1930. The presentation of principles and methods of sampling in chapter I is very stimulating, particularly the discussion of weighted sampling. Standards of Research. Des Moines, Iowa: Meredith Publishing Co., 1929. Pages 27 and 28 give a specific statement of the policy followed in dis- tributing questionnaires. YULE, G. UDNY, and KENDALL, M. G., Introduction to the Theory of Statistics, London: Charles Griffin and Co., Ltd., 1937. Chapter 18 contains an unexcelled statement of the theoretical background of sampling. CHAPTER VI COLLECTION OF DATA DIRECT SOURCES DESCRIPTION OF DIRECT SOURCES IN CHAPTER IV direct sources were defined as the business concerns, government and private agencies, and individuals from whom statistical information not otherwise available could be secured by direct appeal. The type of source to which appeal will be made depends upon the kind of information desired. The internal records of business concerns are the original sources of such data as sales, profits, costs of doing business, employment, wages, and prices. Most of the information is private and can be obtained only on a confidential basis, but business concerns have come to realize the ad- vantage to themselves and to the public of supplying the information for statistical purposes, provided the use made of it is not detrimental. Hence a large amount of valuable information may be obtained directly from business concerns. The next sources from which information can be obtained are gov- ernment and private agencies. Government agencies here refers not so much to those engaged in collecting and publishing statistical information as to those directly concerned with the control or regula- tion of business. Examples are the Board of Governors of the Federal Reserve System, the Federal Trade Commission, and the state public utility commissions. Agencies such as these are in a position to supply a great amount of information on special subjects in addition to that which they publish. Private agencies include trade associations, labor organizations, industrial institutes, charitable organizations, statistical services, research bureaus, and co-operative groups. In many cases these agencies are more valuable sources than business concerns. This is particularly true when information for an industry or an area is desired rather than for individual firms. Finally, the statistician who is interested in data relative to con- sumption habits must expect to get his information from individuals or family groups. This is perhaps the most difficult source from which to obtain data because of the large number of persons who must be canvassed in order to get enough data for statistical purposes, and 92 COLLECTION OF DATA DIRECT SOURCES 93 because individuals so frequently do not possess the desired informa- tion or are unable to give it accurately even though it concerns themselves. There is considerable difference in the actual collection process from direct sources according to whether or not the data exist in the files of the informant in such a form that they can be transferred to collection blanks. Collection from business firms and similar organi- zations quite commonly means merely transferring data from the records. On the other hand collection from individuals may require a lengthy process of interviewing to secure the information wanted. COLLECTING DATA FROM DIRECT SOURCES Once it has been determined that the collection of data must be made from direct sources, and the preliminary decisions concerning census or sample and the use of agents or mail questionnaires have been made, the investigator is ready to put his plan into operation. The work follows a natural sequence of steps regardless of the size of the investigation. The several steps are: (1) the provision for physical equipment; (2) a preliminary study of the field; (3) the choice of cases for the sample; (4) preparation of agents' schedules or mail questionnaires; (5) the selection and training of a staff; (6) supervising the work of collection. There is literally no end to the amount of detail which might be introduced in discussing these steps. The intention is to present no more explanation than is necessary to give a broad view of the work. There is a wealth of reference material available from which more detailed information can be obtained. The Provision of Physical Equipment An office must be set up as a headquarters for the investigation. In some cases only pencil and paper and a place to work are needed. In general, however, equipment for filing, tabulation and calculation, typewriters, forms for recording the progress of the work, and similar materials must be provided. A Preliminary Study of the Field No matter how carefully the general plan of an investigation has been developed there will be certain peculiarities which need to be 94 BUSINESS STATISTICS discovered and provided for before starting the actual collection of data. A preliminary study may bring them to light and at the same time pave the way for the regular work. If there are technical terms used in an industry, these should be known in advance. A knowledge of the form in which records are kept and the units in which data are recorded will aid in phrasing questions. The advice of leading firms or agencies will be useful in showing the proper method of approach to others who are to be canvassed. This advice will be particularly valuable, if there are some concerns that are difficult to approach. A common practice is to test a preliminary draft of questions by submitting them to a small sample of those from whom the informa- tion is to be obtained. The knowledge acquired in this way will aid in preparing the final draft of the questions, provide the background for improved agent technique, and create advance good-will for the investigation. ) The Choice of Cases for the Sample ; The method of selecting the cases to be included in a sample is one of the most vital steps in the entire collection process. For that reason all of chapter V is devoted to an explanation of the principles of sam- pling and the methods of choosing samples. The exact plan to be followed in selecting cases must be worked out in advance by the director and the importance of conformity to the plan must be im- pressed upon everyone connected with the investigation. If any subsequent change in the plan of sampling becomes necessary, such change should be made only with the knowledge of the director. For example, if a field agent in a consumer survey has been assigned a particular family, and finds it unwilling to give information, he should not try the house next door but should obtain a new assignment from the office. ! Preparation of Schedules and Questionnaires The success of an investigation depends to a large extent upon the quality of the questions used. There will be considerable difference in the type of question included depending upon whether schedules in the hands of agents or mail questionnaires are employed. Agents can generally secure replies to questions which are more involved and more personal than those on mail questionnaires. In spite of this difference, the two types of lists of questions can best be discussed together with COLLECTION OF DATA DIRECT SOURCES 95 separate explanations of the points that refer to one and not the other. There are four things to be considered in preparing a schedule or questionnaire: (1) content, (2) wording of the questions, (3) defini- tions, and (4) form. Content. In outlining the content of a schedule or questionnaire, the guiding principle is unity. The questions must be determined in terms of the objective of the investigation. Only those questions should be included which contribute directly or collaterally to the objective. Further, the questions must be so planned that the replies can be tabulated to yield answers to the questions proposed at the outset of the study. This requires that careful consideration be given to the ultimate goal of the investigation. The object of a WPA project in an eastern city was to study the extent of the repair and modernization work which might be antici- pated in the city under Title I of the Federal Housing Act of 1934. The schedule of questions in Figure 2 was prepared for the study. For multiple family dwellings a schedule was to be filed for each dwelling unit in the building. FIGURE 2 SCHEDULE USED IN A REAL ESTATE SURVEY 1. How many occupants? 2. How many rooms? 3. Basement? 4. Stories? 5. Single or double garage? 6. Electric refrigerator? 7. Rent? 8. When was house built? 9. Owner or renter? 10. How long has occupant lived in house? 11. Automobile? 12. Use auto for work? 13. How long to go to work? 14. How many in family are working? 15. What kind of heat? 16. Fuel used? 17. Single or double house? 18. Is house in good condition? 19. Who pays water rent? If questions 5 and 11 disclosed the fact that the family had an automo- bile and no garage, or a single garage and two automobiles, presumably 96 BUSINESS STATISTICS that family could be interested in garage construction. If question 15 showed that the house had no central heating system or had an anti- quated system, perhaps the family would be interested in improved heating installation. If the answer to question 18 was a simple nega- tive, further investigation of the house would be necessary in order to determine whether the deficiency was lack of paint, a rotting porch, a leaking roof, defective plumbing, or other needed repairs. Whatever the deficiency turned out to be, the family could presumably be inter- ested in remedying it. Some of the questions such as 6, 12, 13, 14, 16, and 19 are difficult to justify in this schedule; therefore in the revised form, Figure 3, they have been omitted. The proposed revision is designed to give more information concerning the repairs and modernization needed and to facilitate collection and tabulation. Agents using the revised schedule could save much time and effort because they would not be forced to ask any irrelevant questions and a better impression would be made on the informant. FIGURE 3 PROPOSED REVISION OF REAL ESTATE SCHEDULE, FIGURE 2 HOUSE ( 1 ) Address (2) Stories: 1 2 3 4 B A (3) Single Double Other (4) Year built (5) Garage: 123 or more DWELLING UNIT (6) Floor (7) Years lived in by present occupant (8) Owner Renter (9) Monthly rent (10) No. of rooms bath (11) No. of occupants (12) No. of automobiles owned (13) Heating equipment central heat: Yes No (14) Hot air steam hot water (15) Any repairs needed: Yes No REPAIRS NEEDED (16) House: (17) Dwelling unit: Paint Electric wiring Porch Plumbing Roof Heating system Sidewalk Other Driveway Garage Other COLLECTION OF DATA DIRECT SOURCES 97 Wording of the Questions. When schedules in the hands of agents are used as in the real estate survey of the preceding section, it is not necessary to word questions in so much detail as in a mail question- naire. Since the agents are already familiar with the meaning of each question and the definition of terms, the abbreviated form used in Figure 3 is better than the sentence form of questions in Figure 2. It is easier for the agent to check the answer wherever possible than to write several words, and the uniform marking greatly facilitates tabulation. Where there may be a variety of answers a space is left for writing. For example, the answer to number 6 would be "whole house" for a single family dwelling. The wording of a mail questionnaire is analogous to the agent's conversation with the informants. Since the questionnaire must be filled out by the respondent himself, the questions must be complete sentences and must make their own appeal. Certain practices in word- ing questions have been found so effective as to have almost the force of rules. The ivordmg must be clear to the respondent: Each question should contain but one idea. It must be stated as simply as possible so that there can be no doubt in the mind of the respondent what is wanted. Care should be exercised also to avoid the possibility of an ambiguous answer. For example, the following questions and answers are taken from an investigation made some years ago of the status of the Negro in industry. Common labor In what jobs are both Negroes and whites employed ? Common labor In what jobs are only Negroes employed? Foremen and Mechanics In what jobs are only whites employed? These questions appear to be simple and straightforward. The inves- tigator felt that there could be no doubt that they would be clear to employers to whom the questionnaire was sent. Yet impossible answers such as those recorded for questions 1 and 2 were received in a large number of the replies. Apparently the person filling out the ques- tionnaire disregarded the word "only" in the second question. The maker of a questionnaire cannot expect that respondents will read questions as discriminatingly as was required in this case. 98 The work of the respondent must be kept to a minimum: There are several things to consider in complying with this rule. The use of a few easily answered questions in a questionnaire will increase the per cent of replies. If the answers can be given in a few minutes, the respon- dent is likely to fill them in immediately, whereas a list requiring more time may be laid aside and never picked up again. The number of replies is increased by the use of questions answered by "yes" or "no," or with easily obtained numerical answers or with a list of colors, qualities, places, etc., from which the respondent can check or under- score the applicable ones / The respondent should not be asked to make computations. Hence the question, "What is your annual remuneration?" is not a good one to ask a laborer for two reasons. (1) Not only is the word "remunera- tion" foreign to his vocabulary, but (2) he may be unable to state his earnings except by the day or week. ; Requests for past information should be avoided if possible. The United States Department of Agri- culture was not likely to obtain much usable information from the following request sent to farmers September 15, 1930: 1929 1928 1927 1926 1. Acres sown to wheat in the summer or fall of each year 1930 1929 1928 1927 2. Acres sown to wheat in the spring of each year 3. Acres of wheat harvested in the sum- mer of each year 11. Actual cost of storing in local station elevator, per bushel, per month in each year Make sure that no unnecessary repetition of information is re- quested. Two adverse results arise from failure to heed this warning: (1) The duplication adds to the length of the questionnaire and may be the cause of its being discarded. (2) The impression in the mind of the respondent created by the duplication is likely to be hostile to the point of causing him to discard the questionnaire.! Examples of repetitious and overlapping questions are found in the following list selected from a questionnaire sent to state hospitals by a social agency: 6. To what extent does overcrowding express itself in unsuitable sleeping quarters ? 11. Do you need additional employees? 15. Do you have adequate hospital and medical facilities? COLLECTION OF DATA DIRECT SOURCES 99 19. Do you have adequate facilities for giving your inmates instructive work and recreation? 21. Are your facilities for academic work sufficient? 22. (c) Is your staff of teachers large enough? 28. Are inmates paroled whom you deem unfit to return to the community? 29. What are the outstanding needs of your institution? Bed capacity Medical equipment Employees Academic equipment Teachers Recreational facilities Opportunities for work Extended parole The last question merely asks for information already covered by the preceding questions. Questions 15 and 19 each combine two separate ideas. There are other faults in the wording of these questions which will be referred to later. Form and content must not be offensive: A great amount of per- sonal and business information can be obtained by the use of ques- tionnaires, but great care must be exercised to avoid offense. One cannot ask the question, "What was the dollar value of your net sales last year?" But the approximate data may be secured by asking, "Please indicate in which of the broad groups below your net sales for last year would fall," followed by several sales classes arranged to give enough detail for use in the subsequent steps of the investiga- tion. A question may not be personally offensive, but may involve official complications.' In the example quoted in the preceding section, a hospital superintendent might well hesitate to answer question 28, fearing to give offense to the parole board and to politicians. A ques- tion stating or even implying moral turpitude should be avoided. Likewise questions dealing with religious principles or habits should be used with caution. Bias must be avoided: Bias may enter in two ways. First, the ques- tion may be phrased so as to lead to a certain answer. An example of a biased wording is, M Did the fish cakes taste better to you than canned salmon, salted cod, or shredded cod?" It would be much better to list the four types of prepared fish and request that the user number them in the order of preference. Second, estimates that are based on opinions rather than on actual figures may be biased. Suppose you were inquiring of a manufacturer of drugs whether his product was distributed at retail mainly through chain stores or independent stores. His direct contacts with the buyers of chain retailers might lead him to suppose that they were his chief 100 BUSINESS STATISTICS customers, whereas a study of the sales records might well show the reverse. Answers should be obtained in the most usable form: Tfcfe is essentially a matter of visualizing the subsequent use of the data. In particular all units should be carefully selected and defined from the point of view of the subsequent analysis. When new information is to be used in conjunction with some already in hand be sure that the new information will be comparable with the old. It is also essential that the information be received in a form which facilitates tabulation and analysis. In most questionnaires having more than two or three questions cross-information becomes available by using the results of two or more questions together. It is important to plan the questions so as to develop the maximum amount of cross- information. Conversely, failure to consider this feature of the ques- tionnaire may lead to a serious hiatus in the information which will be discovered too late to be remedied. Figure 4 shows the advantages of foreseeing the subsequent parts of the work when making the questionnaire. The three blank spaces marked "Leave blank for Department use" permit the tabulation on this form of the number of spindles in the mill, the number being currently operated, and the number belted and ready to operate (active spindles as defined in the industry). Infor- mation is also available from which to compile an age distribution of total spindles and active spindles in the mill. Likewise a distribution by kind of spindle can be made for total spindles and active spindles and these can be further classified by age, if desired. Thus we see the amount of information that can be taken from a carefully prepared questionnaire such as this one. All of the tabulation forms were drawn up at the same time the questionnaire was prepared and the two were made to conform at every point. Definitions. In preparing a schedule or questionnaire, any word, phrase, or technical term which may lead to variation of interpretation should be defined. The units in which the data are to be collected must also be defined. These definitions are equally important in the case of schedules and questionnaires, and the necessity for preliminary decisions regarding the terms to be used is the same in either case. However, it is evident from the rules regarding questionnaire con- struction that if a great amount of detailed definition proves necessary the use of the mail questionnaire is to be avoided. Whatever defini- COLLECTION OF DATA DIRECT SOURCES "Js-a ^J!' s 1.? 8 -s 1HK v) x-s J5 y .-j o V ' " *s .s - 'a, S^ v) i o/; g 1 1 I rr 2) *^ M NG Q S "2 s^ y a J .S|-f I O feO "> Q /^ *"*"* P 00 S 6 .3 S 1 a _H fr S -* S g 3 a ? .o o ^ .^ ' 6 S' tj 53 vj ;j w< ^ T5 a'a | li s - * Sv 5 6 ** S* g Sltfs ^3 P 3 Check kin of spindle below I I I 1* x a 6 c 101 102 BUSINESS STATISTICS tions are essential in a questionnaire should be printed as close as possible to the questions to which they apply. The definitions usually do not appear on a schedule but should be printed along witlfthe general instructions to the agents.' Terms: For either method of investigation, the definitions must be inclusive of all the limitations that have been placed upon the col- lection process. They must be so precisely worded that (1) no am- biguity of terms exists; (2) no limitations of terms are left indefinite; and (3) no technical uses of terms are unexplained. Some examples will show the necessity for careful wording and definition*, The treasurer of a department store submitted a list of questions to the department heads of the store. One of them read, "Have you been successful recently with promotions?" There may be no doubt as to what is wanted here, but the simpler thing would be to specify sales promotions rather than promotions of the staff. Referring to the questionnaire sent to state hospital superintendents (p. 98), question 6 reads, "To what extent does overcrowding express itself in unsuitable sleeping quarters?" The phrase "to what extent" is indefinite. Such words as "unsuitable," "adequate," and "sufficient" as used in this questionnaire are meaningless unless related to definite standards. A questionnaire sent to college and university teachers contained this question, "What per cent of your regular salary goes for rent?" This question appears to be simple but the word "regular" is used in a technical sense. Presumably it means the compensation for teaching the usual number of hours per week for a nine months' period. This definition would exclude evening and extension school salary in some cases and include it in others. Summer school salary would not be considered "regular" even for a person who taught every summer. Many other variations exist in different colleges. The word "regular" requires exact definition if the results obtained from the questionnaire are to be comparable. Units: Every kind of recording of numerical information requires that a unit be established in which to perform the process of enumera- tion.' The unit may be a person, an animal, an inanimate object such as a tree or a house, a measured quantity such as a ton or a bushel, a money measure such as the dollar or the franc, an abstract concept such as an order, an accident, or a vacation. In some cases the type of enumeration to be made immediately determines the unit to be used, COLLECTION OF DATA DIRECT SOURCES 103 as in counting population or recording sales. In other cases a choice of units is available as in recording production of cement in which the count could be made in tons, barrels, or dollars of value. In every case in which the unit is not obvious from the nature of the enumeration, selection of a unit must precede the counting process. Once the unit has been selected consideration must be given to the question of whether any uncertainty may arise in its use. In many cases definition will be required to avoid ambiguity; thus in collecting infor- mation concerning the size of houses careful definition of what to count as rooms is necessary. Similarly, in recording industrial accidents, a careful statement must be made of what sorts of injuries are to be included as accidents./ Units can be divided into two kinds: (1) those with definition established by law or custom, and (2) those for which the definition must be established separately wherever they are usedJ \Examples of the first kind are the bushel, the gallon, the yard, the hour, etc. Each of these measures carries a standard definition which serves as an adequate description any time it is employed as a unit. 1 The unit for measuring wheat is the bushel, and no further definition is required. On the other hand when the unit used is a ship, a room, a voter, or a horse it is necessary to explain what shall be counted and what shall be omitted during the enumeration. Thus a room in a dwelling is not a usable unit until many borderline cases such as closets, breakfast nooks, pantries, and sun-rooms have been either included or excluded by definition. If a subsequent investigation is made using a different definition of a room, the results of the two investigations cannot be compared although both use the unit "room" as the basis for counting. These two types of units are usually known as measurement units (fixed definition) and counting units (variable definition). To call the latter type counting units seems somewhat ambiguous because the process of enumeration involves counting regardless of the type of unit used. For that reason we prefer the distinction based on the amount of definition required^ Variable definition units such as a "room" exist independently of the counting process, a fact which explains the need for definition each time they are used. As separately existing entities they possess individual differences which inevitably run to limits where it is neces- sary finally to establish a boundary. The same situation does not arise when a unit such as the pound or mile is used. 104 BUSINESS STATISTICS Definition is required to different degrees in the realm of variable definition units. A "person" needs little or no definition because the unit is so universally recognized. Likewise the unit "citizen," a part of the universe of persons, has a fixed legal definition in each country. On the other hand such a unit as a "salesman" or a "criminal," each a part of the universe of persons, requires very careful definition. Only a few units are so well recognized as the person or citizen, hence the general conclusion that one must always be prepared to state variable definitions completely. An example will illustrate what is likely to occur in particular cases. The instructions accompanying a schedule included this statement: "Information to be secured for wage-earners only." Trouble arose continually in determining exactly who were wage-earners. The gen- eral concept "wage-earners" was perfectly clear, but the definition of the statistical unit a "wage-earner" was extremely difficult. On the face of it a wage-earner should be one who receives compensation from others for services rendered. But questions such as the following were brought in by the agents daily: "How about a physician who received a salary instead of fees?" "How about a daughter living at home and working for her father at a nominal salary?" "How about an insurance agent working on commission and receiving a fixed percentage of the annual profit?" As soon as these questions were answered, others arose. The answers given to such questions as these depend upon the purpose of the investigation. But no matter how carefully any such unit as "wage-earner" is defined borderline cases will arise which will have to be settled arbitrarily by the director. Form. General: The primary consideration that determines the form of a schedule is convenience, whereas the first requisite of a mail questionnaire is good appearance. In either case the size, shape, mate- rial, and type of printing are important. Whenever possible cardboard rather than ordinary paper is to be preferred. Cards are easier to handle in an interview and more durable for editing and tabulating. If cards are not feasible then a good quality paper should be used. One sheet of questions is always preferable. However, if the choice is between overcrowding of questions and the use of an additional sheet, the latter is the lesser of two evils. Overcrowded questions are likely to result in misplaced answers, incomplete answers, and increased difficulty of tabulation. The first impression made by a mail questionnaire will determine to COLLECTION OF DATA DIRECT SOURCES 105 a large extent whether it will be answered or discarded. A closely printed sheet or card immediately gives the impression of being lengthy and time-consuming. This can be partially overcome by well-spaced questions, the use of rulings, and variations in type sizes. Sequence of questions: The questions should be arranged so that they will form a natural sequence for the respondent. Although the sequence must be varied to meet the requirements of different inves- tigations and will not be identical for schedules and mail questionnaires some general principles can be stated. 1. The initial question or questions should be simple. 2. Any preliminary questions which are necessary to pave the way for the key questions should come next. 3. The key questions, that is, those which relate to the major pur- pose of the investigation, should be placed at the end or near the end. 4. If one or more questions calling for an opinion rather than a statement of fact are included in the schedule, they are usually placed at the end. 'Figure 5 taken from the questionnaire used early in 1935 in a market survey of college men employs a good sequence with one exception. FIGURE 5 RADIO SKCTION OF QUESTIONNAIRE USED IN SURVEYING TUB COLLEGE MARKET 1. Do you have a radio in your college room? Yes No 2. What make? 3. When bought? 4. How many tubes? 5. Where bought? College town Outside college town 6. Do you intend to purchase a radio before 1936? Yes No 7. If so, what make? 8. About what price? Questions 1, 2, and 4 provide the facts concerning the student's present radio. The questions are simple and the information is im- mediately available. Questions 3 and 4 should exchange places. Questions 3 and 5 give preliminary information leading up to question 6. Question 3 was presumably intended to give information concern- ing the age of the radio. The answer will, of course, be misleading if the student bought the radio second hand. Question 6 is the key question of the schedule. It gives the information concerning the potential market for radios among college men during the current year. 106 BUSINESS STATISTICS Questions 7 and 8 correctly follow question 6 since they ask for infor- mation supplementary to it. Auxiliary material: In addition to the list of questions certain ex- planatory material must be prepared. Most important is a letter of transmittal to accompany a questionnaire or instructions to agents who collect schedules. The purpose of the letter of transmittal is to engage the attention of the addressee and encourage him to respond. The instructions to agents provide a background of information which will permit them to accomplish the same result by personal interview. Letter of Transmittal with a Questionnaire Motives: In a mail questionnaire the questions themselves may be preceded on the same sheet by a brief explanation of the purpose of the investigation and the reasons for requesting information from the particular persons to whom the questionnaire is sent. The usual method, however, is to inclose a letter of transmittal explaining the questionnaire and pointing out some incentive for answering. When sent to business men it is also customary to include a duplicate copy of the questionnaire for the respondent's own files. Some of the motives to which an appeal for voluntary replies can be made are: (l) co-operation; (2) interest; (3) profit; (4) obliga- tion; and (5) position. Figures 6-10 illustrate from actual examples how appeals can be made to different motives. FIGURE 6 CO-OPERATION DEAR SIR: Will you be kind enough to take a few moments of your time to jot down the answers to the questions on the back of this letter? We would greatly appreciate this favor. You will not be asked to buy anything your name will not be used at all. This letter is one of a small number we are sending out at the request of a large manufacturer to get the viewpoint at first hand from some of his customers. It is not necessary to write a letter. Just check or fill in your answers in the space provided and mail the sheet back to us in the enclosed stamped envelope. Many thanks for your help. Very truly yours, [An Advertising Agency] COLLECTION OF DATA DIRECT SOURCES 107 FIGURE 7 INTEREST DEAR FELLOW-MEMBER: The subject of calendar reform has been studied during 1933 by a committee of the American Statistical Association, and its majority and minority reports appear in the supplement to the Association's Journal for March 1934. Before taking action, the Board of Directors wished to pursue the subject further, and has appointed the present Committee, with instructions to ascertain the considered opinion of the Association's membership on the question of calendar reform. The problem presented by the defects of the present calendar is obviously of importance to the statistical profession, a number of whose mem- bers deal with the analysis of time series. It is the purpose of this Committee to obtain as far as possible the considered opinion of the whole membership. To aid in this undertaking, will you kindly fill out the questionnaire and return it to the Committee at the earliest possible date. Yours very truly, [A Committee Chairman of the American Statistical Association] FIGURE 8 PROFIT DEAR SIR: In order to assist farmers in adjusting the production of milk and dairy products to prospective demand, the U. S. Department of Agriculture is now undertaking to collect more complete information regarding the number of cows being milked, the quantities of milk and cream being produced and sold and such information regarding the number of heifers being raised, the number of cows coming fresh, the quantity of grain being fed and dairymen's plans for the future as may be needed to find what changes in production may be expected. Those who cooperate with the Department by returning each month a report for the herd which they own or operate will receive copies of the reports. On the other side of this page you will find some questions regard- ing the quantity of milk now being produced on your farm, the last price received, and the quantity of grain being fed. In return for your assistance I am enclosing a summary of the outlook for dairying so far as this Department has been able to determine the outlook from such information as is now available. Yours very truly, [A Division Chief of the United States Department of Agriculture] 108 BUSINESS STATISTICS FIGURE 9 OBLIGATION DEAR SIR: Those industries which today are making progress toward stabiliza- tion know their capacity as well as their demand. To bring some light to the capacity situation in worsted spinning, the Research Department reported briefly to the trade in December on post-war trends in worsted sales yarn spindles. Lack of figures at the time of that release prevented a detailed analysis of current worsted spindlage. We are now ready for that step an industry- wide inventory of all worsted spinning spindles in the textile mills of this country as of March 1st, 1931. Leaders of your industry have approved this survey. Typical of their attitude is that expressed in a recent letter received from Mr. , a copy of which we are glad to submit at this time. You are well aware, of course, that a survey of this nature is only successful to the degree that all firms in the industry respond. In a word the value of this survey to you is very directly tied up with the number of firms in the industry who supply the information outlined on the enclosed schedule. An early reply on your part will help insure an early report of the results by this Department. Please do not hesitate to bring to our attention any problems that may arise in filling out this schedule. We will gladly assist you in any way possible. Sincerely yours, [A Director of a Research Bureau] FIGURE 10 POSITION FELLOW-ECONOMIST : In order to help toward straight thinking on the prohibition ques- tion, will you please fill in the accompanying questionnaire to the best of your ability? Please note that many of the answers must represent opinions, but that your unbiased judgment as an economist is desired. I assure you that your name will not be mentioned in any way unless you give permission. Please do it now, using the enclosed stamped envelope for mailing your reply. Sincerely yours, [A University Professor] Distinct from questionnaires which depend upon making an appeal to some voluntary motive are government requests for information to which answers are compulsory. The element of compulsion is pre- dominant in such a letter, as Figure 11 will show. COLLECTION OF DATA DIRECT SOURCES 109 FIGURE 11 COMPULSION SST-1564-CLH TREASURY DEPARTMENT Office of the Collector of Internal Revenue Pittsburgh, Pa. July 29th 1937 B Company M , Pa. The records of this office indicate that you filed an application on Form SS-4 for an "Employer's Identification Number" under the provisions of Title VIII of the Social Security Act and that you were assigned the Iden- tification Number above indicated. This Act requires every Employer to file a return for each month and pay the tax shown to be due, effective from January 1, 1937. The records of this office indicate that you have not complied with the law in this respect by reason of your failure to file a return for the months of January, February, March, April, May, and June, 1937, inclusive. You are, therefore, requested to file a separate return on the blank forms inclosed for each month above mentioned and to forward the same to this office with your remittance for the tax due. An affidavit in explanation of your failure to file the returns within the time prescribed by law must accompany the returns for the consideration of the Bureau in connection with the assertion of the delinquency penalties. (Blank affidavit inclosed.) In preparing these returns, your complete Name, Address and your Identification Number must be shown thereon as indicated at the top of this letter. If, for any reason, you are not subject to the provisions of this Act, please advise this office fully, or, if a return was filed by you in another District, advise the date and the place where filed, also the serial number stamped upon your cancelled check. Reply should be made within ten (10) days from the date of this notice. Very truly yours, [An official in the Internal Revenue Office] Sometimes several motives for securing replies will be combined. The question of what type of appeal to use will depend upon the par- ticular circumstances involved. Considerable care should be exercised in writing these letters because they must accomplish in an investigation by mail what an agent does by personally interviewing prospective informants when schedules are used. 110 BUSINESS STATISTICS Instructions to Agents: There are two parts of the instructions to agents: (1) the definitions of terms and units used in the schedule and (2) the general instructions. The definitions were discussed on page 100. An illustration of what should be included under general instructions is provided in Figure 12, which was prepared for a study of home ownership in Buffalo, New York. FIGURE 12 INSTRUCTIONS TO COLLECTING AGENTS PRELIMINARY You have in your possession a letter identifying you as an agent of the President's Conference on Home Building and Home Ownership. Use this identification discretely, remembering always that you must not use compulsion in seeking replies; on the other hand, you do have back of you the authority of a government investigation which should instil confidence and insure to the informant that all information given will be treated as absolutely confidential. In the use and publication of results no individual information will ever be divulged. No identification appears on the schedule except the case number. On the separate sheet which is provided for the purpose keep an accurate record of the exact address from which each case is taken. Agents must be doubly careful to keep this special record accurate or the editing of the questionnaires may be seriously handicapped. This study is confined to families having total income (earned or other) not exceeding $3000. You will be unable to determine exact income be- fore reaching page 3 of Form 1. A preliminary question, however, should determine the availability of the family. If exact tabulation of page 3 should show total income slightly in excess of $3000, the schedule will be used. Information is to be collected only from families composed of a minimum of husband, wife, and one dependent child. Where you find more than one family occupying quarters which are quite dearly intended for a single family, do not fill out Form 1 but secure the information on Form 2. Information is to be collected only from families who were purchasing homes during 1930, but who began paying for their homes prior to January 1, 1930. Information is to be collected only from families in which both parents are native bom whites. The most inexcusable errors in compiling data are those arising from carelessness on the part of collecting agents. Therefore, COLLECTION OF DATA DIRECT SOURCES 111 (a) Write legibly; be neat. (b) Be sure that you understand all questions and instructions. (f) Before dosing your interview, check to avoid omitting any part of of your schedule. Selection and Training of Staff Types of Workers. The number and types of workers needed de- pend entirely upon the character of the investigation. If it is a ques- tionnaire study, the problem of organizing a clerical staff for the preliminary work is in no way peculiar to a statistical inquiry. How- ever, when agents are to be used the selection of personnel presents a more specialized problem. In both types of investigation, editors and a staff of statistical clerks will be needed as soon as collection is under way. The selection and training of all these workers becomes a part of the process of conducting an investigation, particularly if the only available staff consists of students or other inexperienced workers. Qualifications and Training. tit is essential not only that each of these workers receives instruction as to his own specific duties, but also that each acquires a thorough understanding of the entire process. 1 For example, if the agents receive some training in the method of tabula- tion, they will realize why they are asked to make entries in a uniform manner. If the editors have even a slight experience in collecting the schedules, they understand the difficulty of getting exact information and will be more likely to offer their criticisms to the agents without arousing antagonism. 'Testing of the various staff members in different parts of the work has the additional advantage of discovering which ones are best adapted for editing and which ones make the best agents. Some individuals may not be successful in making personal contacts and getting information from other people but may have an eye for detail and a capacity for detecting errors. The latter qualities are de- sirable in an editor, but even more important is the ability to appraise a schedule as a whole for consistency and validity. This is especially true when long and complicated schedules as in a cost-of-living inquiry are being collected. A schedule may balance perfectly having no specific errors of any kind and yet contain gross inconsistencies or omissions that can be detected by an editor with common sense: For example, in a cost-of-living investigation good editing would imme- diately question a schedule in which some income was reported from insurance benefits resulting from the death of a member of the 112 BUSINESS STATISTICS immediate family, but where no item for funeral expenses appeared under "expenditure." Agents become the direct representatives of the organization con- ducting the inquiry in making contacts with those from whom the data are to be obtained. Upon their shoulders rests to a considerable degree the responsibility for the success of the undertaking. A good agent must have "tact" defined as "intuitive perception, a ready apprecia- tion of the proper thing to say or do, especially a fine sense of how to avoid giving offense." In other words he must be a good salesman in order to "sell" to the informant the idea of answering the questions. Training in certain techniques must be added to this natural capacity before the agent becomes a qualified field representative. First, he must thoroughly understand the general purpose of the investigation and believe in it himself. He should be able to explain it and con- vince the informant of its validity without the necessity of referring to his letter of credentials. Agents must always be furnished with such credentials for their own protection, but an official letter prac- tically never has a persuasive effect on an irate informant even though the investigation is being conducted under government authorization. Next, the agent must be so familiar with the schedule and all of the instructions, definitions, and limitations that he can conduct the inter- view and complete the schedule without any hesitation or reference to notes. / He is never permitted to alter the meaning of a question or definition and if in doubt on any point he should add full notes de- scribing the situation. If an unusual situation not contemplated by those who planned the investigation should arise, a well-trained agent should be prepared for such contingencies. His function is to secure complete information on the unusual case so that final disposal can be made by the person in charge of the investigation. In such cases it may be advisable not to complete the interview, but to leave the way open for a return visit after having consulted the director for advice. All of the information which the agent secures, whether written on the schedules or given orally in an interview, is completely confidential. Inexperienced agents sometimes forget that they are not at liberty to discuss collected information with anyone, even with fellow-agents and much less with friends. Failure to observe this rule can be ex- tremely embarrassing, if information that was given in confidence comes back to its origin through third parties, and may have even more serious consequences. COLLECTION OF DATA DIRECT SOURCES 113 Example of Agent Technique. The following example shows the effect of proper and improper agent technique: Several years ago an investigation of housing conditions was being made in a large city. Although the investigation was sponsored by a committee which had been appointed by the President of the United States, it was not official nor was anyone compelled to give informa- tion to the agents. In spite of the fact that the status of the investiga- tion had been explained fully, one of the agents insisted upon answers to his questions to a point which caused an irate housewife to call the police station. The agent was picked up by a policeman on the house- wife's complaint, taken to the station house, and subsequently to the office of the police commissioner. At this point the director was called to the commissioner's office. The director's explanation convinced the commissioner that no crime had been committed, but did not convince him that agents should be permitted to annoy housewives any further. Fortuitous circumstances entered the case at this point. A detective who had been assigned to the case was called in by the commissioner. Quite unexpectedly the detective reported that another agent had ap- peared at the door of his home the previous evening, that the agent had been courteous, his questions inoffensive, and that the detective had been entirely willing to give the information requested. The commissioner then consented to have the investigation continue pro- vided the offending agent were dismissed. Perhaps it is superfluous even to point out the value of the second agent's work. By the proper approach this man had secured the in- formation that he wanted, had created good-will for himself, and quite unwittingly had saved the entire investigation. In addition to showing how agents may succeed or fail, the example points to the desirability of notifying police authorities before sending agents out to do house-to-house investigation. Supervising the Work The foregoing example indicates the necessity for constant super- vision on the part of the director while the investigation is in progress. He must be available at all times so that any unusual situations can be met as they arise. On the other hand every detail of the routine plan should be a matter of written record, and each part of it should be thoroughly understood by at least one other staff member, so that the orderly progress of each step is automatic regardless of the presence 114 BUSINESS STATISTICS or absence of any one person. These routine steps include: the check- ing of the adequacy of the sample as the collection proceeds; a regular system for making assignments, accounting for returns and routing of schedules to agents, to editors, back to agents if necessary, and finally to tabulator; arrangement for check interviews by visit or telephone as a test of each agent's ability and integrity; adherence to the quota and time schedule originally planned for the investigation; issuance of additional or revised instructions to the entire staff when- ever necessary; and provision for staff meetings at regular intervals for the discussion of difficult points that may arise. / SUMMARY In the conduct of any specific investigation numerous details arise that cannot be discussed in a general textbook. No attempt has been made in this chapter to furnish a complete guide for a person under- taking a statistical investigation in any particular field. Many books * devoted solely to the description of research techniques can be con- sulted to supplement the statement of principles and methods presented here. PROBLEMS 1. Assuming that information on the following subjects is to be obtained by direct collection, which of the three types of sources listed in the text should be used in each case? a) The brands of bread used by families in a city. b) The distribution of employed persons in a city according to a classi- fied list of occupations. f) The extent of use of different types of anti-freeze solution in auto- mobile radiators. d) The extent of unemployment of union labor and non-union labor. e) The tendency toward the construction of lower-cost houses in urban centers. /) The distribution of vacant dwellings in a city according to rent level. g) The effect on sales of milk in a community of an increase of 7 per cent in the retail price. 2. What is the purpose of preliminary testing before starting a direct investi- gation ? 1 A few of these are listed at the end of the chapter. COLLECTION OF DATA DIRECT SOURCES 115 3. Explain the difference in wording of questions in a schedule and a ques- tionnaire. 4. Explain which of the following alternative wordings is preferable for a questionnaire and why: a) (1) What color do you prefer for your next automobile? (2) Check the color you prefer for your next automobile: green maroon blue brown grey gun metal black other (specify) b) (1) Do you consider the advertising statements of local stores more reliable than statements found in magazine advertising? More Less Do you consider the statements made in advertising over the radio more reliable than those found in newspapers? Yes No Do you feel that a statement in an advertisement is more reliable than a statement by a clerk in the store? Yes No (2) Mark in the order you consider them dependable the following media of information concerning consumers' goods (mark the most dependable 1, etc.) magazine advertising radio advertising newspaper advertising statements of clerks in stores c) (1) Do any of the following apply to your concern? (Check which.) too many salesmen sales management inefficient sales territory poorly allocated sales commissions too large (2) Which of the following would be most effective in reducing selling expenses in your concern? (Check one.) reduction of selling force reorganization of sales management reallocation of sales territory reduction of sales commissions 5. Define the following terms for use in a schedule or questionnaire. Be sure to provide for possible borderline cases, a) a farm; b) a factory; c) an employed person; d) a department store; e) a radio news broadcast. 6. Explain the difference between fixed definition units and variable definition units. Give three examples of each not taken from the text. Explain the need for definition in each of your examples of a variable definition unit 116 BUSINESS STATISTICS 7. Write a letter to accompany the radio questionnaire of Figure 5, page 105. 8. The following questionnaire was sent to the subscribers of a magazine by the management of the magazine. Write a letter to accompany the ques- tionnaire. Name Age Address City Occupation Name of Company Position Are you the head of a family? Number in family Do you own an automobile? Make Year Do you own your home? Number of Rooms What are your hobbies? Where do you spend your vacations ? Do you own a radio? Make Have you a telephone? Suggestions .. . .. .. .. ...... . OO^**" v "*' - -------------------.------.-. 9. Summarize the qualifications of a collecting agent. REFERENCES BOWI.EY. ARTHUR L., Elements of Statistics. London: P. S. King and Son, Ltd., 1920 (fourth edition). Chapter III presents rules for the preparation of schedules and several examples of direct collection. BROWN, LYNDON O., Market Research and Analysis. New York: The Ronald Press Co., 1937. Chapter 9 contains an excellent statement of the rules to be observed in preparing a questionnaire. DAY, EDMUND E., Statistical Analysis. New York: The Macmillan Co., 1925. Statistical units are discussed on pages 17-23. EIGELBERNER, J., The Investigation of Business Problems. New York and Chicago: A. W. Shaw Co., 1926. Chapters IX, X, and XII deal with the mechanics of direct collection. SAUNDERS, ALTA G., and ANDERSON, CHESTER R., Business Reports. New York: McGraw-Hill Book Co., Inc., 1929. Chapters VIII and IX deal with the collection of data; emphasis on the preparation of questionnaires. COLLECTION OF DATA DIRECT SOURCES 117 Standards of Research. Des Moines, Iowa: Meredith Publishing Co., 1929. Pages 20-26 contain an outline statement of the "Planning of Question- naires/' The use of agents is outlined on pages 40-43. The Technique of Marketing Research. Prepared for the American Marketing Society by the Committee on Marketing Research Technique. New York: McGraw-Hill Book Co., Inc., 1937. This entire book should be required reading for every person who expects to engage in marketing research. CHAPTER VII EDITING AND PRELIMINARY TABULATION TWO IMPORTANT steps, editing schedules and preliminary tabulation, follow the collection of data and precede the preparation of tables for presenting the collected information. Inadequate attention has sometimes been given to these processes be- cause they do not require the use of involved techniques. Nevertheless an understanding of the methods of editing schedules and of trans- ferring information from schedules to the initial tabular form is an essential part of the conduct of an investigation. The two processes are distinct, hence this chapter has been divided into two parts. EDITING SCHEDULES As agents' schedules and mail questionnaires are returned they must be studied very carefully in order to detect any irregularities in the responses. Experience demonstrates that this step is necessary whether the collection has been made by agents or by mail, although more questions will be answered incorrectly in mail questionnaires than in schedules collected by agents. Before any analysis is undertaken these errors must be detected by an editor and corrected if possible. The editor performs two functions: (1) detecting irregularities in the replies and (2) preparing the schedules 1 for tabulation. Editing for Irregularities There is no fixed order in which the editing should proceed. That is within the discretion of the editor. The following order will serve in many cases: (1) look for omissions, (2) verify check questions, (3) check for inconsistencies, (4) search for errors, and (5) check for uniformity between schedules. Look for Omissions. Each schedule should be complete. If the answers to any questions are missing an attempt should be made to get the information either by mail or by a second interview with the 1 As used in this chapter the word "schedules" refers to collected information whether obtained by agents or through the use of mail questionnaires. 118 EDITING AND PRELIMINARY TABULATION 119 informant. Failure to obtain the information by these means may cause the editor to mark that part of the schedule "no report" or if the missing information is primary, to discard the schedule. In the chain-store inquiry cited in the chapter on sampling (pp. 79-81), 195 schedules were discarded entirely because primary information was missing and 50 other schedules were incomplete in some respect. Verify Check Questions. If the collection form includes answers to questions which should verify or check each other and these fail to check, the editor must search for collateral information that will indi- cate which of the responses is in error. For example, the age of a house may be stated as 22 years (in 1936), the date of construction as 1920, and the initial mortgagee as a bank which was liquidated in 1916. The date of construction has apparently been given incorrectly, but the editor must not guess about this. If no collateral verification is possible, either the schedule must be returned to its author or the answers to these questions must be discarded. Check for Inconsistencies. There will often be questions the answers to which can occur only in certain combinations or in certain sequences. The editor must test these combinations and sequences for consistency. Replies to the following two questions sent in by a gaso- line and oil service station are inconsistent: What disposal do you make of bulk motor oil distributed to you on quota and unsold? (check which) Throw away Sell at lower price Return to agency for credit "! Mix with new oil received Allow to accumulate for waste use At what price per quart do you sell motor oil? Heavy body 31 cents Medium body 30 cents Light body ? Old stock left over ?..... (known in the trade as last year's oil) If last year's oil which is unsold is returned to the selling agency for credit, then the 20-cent selling price quoted has no meaning. Either I^U 1SUMJN1&& MAJL1M11& some old oil is sold or the price quotation on it should be removed. The editor must find out which answer is to be amended. ,. Search for Errors. Any calculations which are on the schedule should be carefully checked. A tabulation of a total and its parts should also be verified. Errors which occur in numerical relations can usually be corrected by the editor. There are some cases, however, in which errors of this kind will require a resubmission of the schedule to the maker. Beyond these obvious things there may be others which can be de- tected only by a careful study of the answers to all of the questions. For example, a research bureau was receiving monthly reports of sales from a number of department stores. The sales of one store seemed to move opposite to the others in June. After this had happened for the third year, the director of the bureau grew suspicious. The difference might be the result of a special sale in June, or be due to the handling of special seasonal merchandise, but preliminary study of the case failed to disclose any reasonable cause for the difference. Finally a direct appeal to the co-operating store disclosed the fact that the controller, who regularly made out the monthly report, took his vacation in July and his substitute who made out the report of June sales misunderstood the schedule and reversed the May and June figures. Check for Uniformity between Schedules. The editor should check for uniform interpretation of all of the questions. He is quite likely to find that one or several questions have been misconstrued on some of the schedules. These things may not be evident in studying the schedules individually but may appear when one question is studied on all of the schedules. In an investigation of moving picture attendance by students, this question was asked: "How much did you spend for moving picture admission last week?" One of the student agents was noted for his erratic behavior and on this question he ran true to form. Many of his schedules showed an expenditure well above that of other schedules. Inquiry elicited from him the fact that he had probably asked for expenditures during the past month. All of his schedules were dropped from the investigation. Re-editing If the investigation is a large one or the schedule form is complex, two editors should go over the schedules independently. The second editor will find things which the first one overlooked. In fact the EDITING AND PRELIMINARY TABULATION 121 schedules will always be somewhat less than perfect. As much time and money as possible should be used to improve them. Sometimes it will be better to have different editors check different parts of the schedule. This plan has the advantage of concentration but is weak in that no one person gets a comprehensive view of the schedules as a whole. The work of each editor should be distinguishable either by the use of different colored inks or other distinctive marking so that any cor- rections or alterations by an editor can be referred back to their author if necessary. It should be a fixed rule that editors do not erase; they cross out and substitute in all cases. Preparing the Schedules for Tabulation After the various irregularities have been adjusted, some steps still remain to be taken before the schedules are ready for tabulation. In the course of these adjustments changes will have been made on some schedules, unusual markings will appear on others. The editor should indicate specifically how such items are to be tabulated if there is any chance of subsequent misunderstanding. To facilitate the transfer of information from editor to tabulator, all final corrections should be made in ink of a certain color. Finally, the editor should indicate the proper classifications of items if they are to be tabulated in a form different from the way they appear on the schedule. For instance, if the question "What is your occupa- tion?" appears on the schedule and no check list accompanies it, the answers will appear in a variety of forms. The editor must mark these replies according to the occupational classification to be used in tabula- tion. Again, the schedules may show the state from which they come, whereas the tabulation is to be made by geographic areas. These areas should be marked by the editor. This sort of editing adds to the speecf and accuracy of the tabulation and makes it unnecessary to employ a highly skilled staff in the tabulation process. Sometimes the process of preparing the schedules for tabulation involves an intermediate step known as coding. The best example of this occurs when mechanical tabulation is employed. Coding of infor- mation to be transferred to "punch" cards becomes one of the most important steps in the mechanical process which is described in the second part of this chapter. 122 BUSINESS STATISTICS PRELIMINARY TABULATION The methods to be used in transferring information from the col- lection forms to preliminary tables will depend upon the size of the investigation, the character of the data, and the ultimate form in which the results are wanted. The following four methods are available: (1) sorting-counting, (2) the use of a tally sheet, (3) the use of a work sheet, and (4) mechanical tabulation. Sorting-Counting The sorting-counting process can be used to advantage when the data to be tabulated are relatively simple so that each case can be put on one small card. The cards can be sorted and sub-sorted into piles according to any desired plan of classifying the data. The number of cards in each pile can then be counted and recorded on a tabular form prepared for the purpos^. The card used in an investigation of vacant dwellings in Buffalo, New York, has been reproduced as Figure 13. The purpose of the investigation was to discover the amount of vacancy in different sec- tions of the city, the extent of vacancy in single and multiple family dwellings, and whether more or less vacancy existed in buildings de- signed jointly for business and dwelling occupancy. The sampling method was used, every sixth census enumeration district being can- vassed. About 24,000 of the 140,000 families in the city were included in the study. A card was filled out for each dwelling place; hence the number of cards equalled the total number of places which either were or could have been occupied by a family. Thus for a three-family house three cards would have been turned in with the same address, each card recording the status of a single flat in the building. The cards were kept separate by enumeration districts. After they had been edited and the number of cards from each district recorded on a master sheet, the cards were ready to sort. They were first dis- tributed in five piles according to the number of dwelling places in the building. Each pile was then sorted into residential or combination residential and business. Each of these ten piles was sorted according to whether the dwelling place was occupied or vacant. The number of cards in each of the twenty piles was then counted and the results entered in the proper row of Table 12. EDITING AND PRELIMINARY TABULATION 123 FIGURE 13 COLLECTION CARD USED IN RESIDENTIAL VACANCY INVESTIGATION IN BUFFALO, NEW YORK Serial No Address Ward Tract Enumeration District ... No. of Dwelling Places in Building: One Two Three . Four Over Four (give number) Occupied Vacant Residential Combination . Agent The cards were then collected into a single pack, shuffled, and turned over to another tabulator to be sorted, counted, and entered in- dependently on a duplicate of Table 12. The cards were then held in distributed form until a third person checked the two records together and checked the total of the row with the number of dwelling places in that district as shown on the master sheet. If the two records agreed, and the totals checked with the original count, they were considered to be correct. If not, the cards were recounted until the discrepancy came to light. A similar procedure was followed in each enumeration district. As indicated in Table 12, the results were sub- totaled by tracts, if there was more than one enumeration district in a tract, and in every case by wards. This plan made it possible to use whichever of the geographic subdivisions was desired in subsequent work. Tally Sheet The use of a tally sheet is the reverse of sorting-counting in that the schedule cards or sheets are not separated into piles according to the various classifications. Instead a blank form, or several of them in a complex investigation, is made up to conform to the classifications of the data. The information is then scored on the blank form as it is read from the collection form. One person should read from the 124 BUSINESS STATISTICS a 3 o S I fa > 1 s H 1 i 1 DWELLING PLACES IN BUILDINGS ACCOMMODATING OVEK FOUK FAMILIES Combin- ation > oooo o oojo o oj I o II -2 '0, 1 II o oooo o O OO|o OO| !J > o o o o o OJO O| oooo o 000 |000| o rH FOUK FAMILIES Combin- ation > oooo O|O O| oooo 0|0 0| "(3 8'g > oooo o o|o O O| o oxr o XT o o o|o ooj THREE FAMILIES Combin- ation > oooo o o o o|o o o| o ONO ON fO O *n|o O O| cA .i*^ > oooo O o o o|o o oj o cOXF C\O * o vo vo | to m vo | rH Two FAMILIES Combin- ation > oooo O o o o[o o o| O CM CM <N rH vo rH Xj O>fr|o CM CN|| vo i! oooo OlO 0| ONO Sfl vo NO fO o M CM xrxj- QO 00 (N O, r- oo vo CM er> 5 M 6 Combin- ation > -<= - OrH ^| 0| - f*>" if\ *^ NO a -o-|o-"*l ^ * rH CN O rH XT 0|0 0| ON ro oo co vo NO CN ON jj ONXf r> rH r\ rx rH rH CNJ ? 5sl o XT W*" rH 00 VO CM rH oo o rr\ir\ rH f\l rH rH 1* V0| 00| |d - cs EDITING AND PRELIMINARY TABULATION 125 schedules while another person records on the tally sheet. There is an advantage in having several persons record the information simul- taneously on separate sheets in order to secure one or more checks from the same reading. This, however, provides no check on the accuracy of the reading, so that perhaps the safest procedure is two independent readings and recordings. The weakness of the method lies in the fact that an error can be corrected only by rereading all the schedules. A device which partially overcomes this weakness is to divide the schedules into piles and then subdivide the tally sheet into corresponding parts. An error can then be localized in one of the piles and the rereading confined to that one. ' If this method were to be used in tabulating the information of the vacancy survey in Buffalo, the tally sheet would have the form shown in Figure 14. One person would read off first the number of dwelling places in the building, then whether the building was resi- dential or combined business and residential and then whether occupied or vacant. The person doing the tallying would locate the proper block as the information was read and would register one stroke for each dwelling place. The subtotals by enumeration districts, tracts, and wards would be recorded as indicated on the tally sheet. If there is too much cross-classification of the data, obviously this method becomes cumbersome. In that case it is probably better to abandon the tally sheet and use sorting-counting or the work-sheet method explained in the next section. The tallying process may be simplified, however, by sorting the schedules first into their major classifications and then tallying by subgroups. 1 In Figure 14 the cards would be separated according to enumeration districts before the tally- ing was done. The tally-sheet method is often the most desirable to use in taking information from a published source. For example, if one wished to record the number of industries in an area according to number of employees, the best method would be to make up a tally sheet with a classification by number of employees and tally the industries as they were read from the Census of Manufactures. The Work Sheet The purpose of a work sheet is to bring the information together in more convenient form than the schedules, so that it will be ready for further tabulation and analysis. After having been edited, the FIGURE 14 TALLY SHEET FOR RECORDING RESIDENTIAL VACANCY IN BUFFALO, NEW YORK WARD 1 TRACT 2 z 4 5 ENUMERATION DISTRICT 11 8 6 2 ONE FAM- ILY Residential O Ttu mi mi mi mi mi mi mi HU mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi iHi mi mi mi mi mi 169 mi mi mi mi TNI mi mi mi mi mi mi mi 111 63 mi mi mi mi mi 111 28 mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi 111 93 353 V i 1 u 2 i i 4 Combination O mi u 7 mi lui mi 15 1111 4 mi i 6 32 V 111 3 i 1 i i 5 Two FAM- ILY R mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi nu mi mi 1111 104 mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi mi rni mi mi mi mi mi mi mi 160 mi mi mi mi mi mi 30 mi mi mi mi mi mi mi mi mi mi mi nu mi mi u 72 366 V C mi mi u 12 u 2 u 2 16 V THREE FAM- ILY R O 111 3 mi mi mi mi 1111 24 mi 1111 9 36 V C tHl 1111 9 9 V FOUR FAM- ILY R 1111 4 4 V C V OVER FOUR FAM. ILY R V r O V EDITING AND PRELIMINARY TABULATION 127 answers recorded on all the schedules are transferred to a single sheet or several sheets, depending on the size of the investigation. The headings of these sheets will correspond to the questions on the col- lection schedules. Thus any tabulation which can be obtained from the original collection forms can be taken equally well from the work sheet./ In chapter VI, page 105, a questionnaire was presented concerning the college market for radios. Figure 15 is a proposed work sheet for recording this information. One row of the work sheet is used for each questionnaire, thus the identity of the information is preserved. At least one sorting of the schedules can be made prior to recording any of the information on work sheets. In this case, the first obvious question on which to sort is whether the student now owns a radio. Second, from the non-owners, those who have not expressed the inten- tion of purchasing a radio during 1936 may be eliminated entirely. There is no information that needs to be tabulated regarding this latter group, except a count of the total number of such cases. The complete form of work sheet shown in Figure 15 is needed only for the schedules of present radio owners. The non-owners who expect to buy can be recorded on a much shorter form that includes only the last three sections which deal with expected purchases in 1936. This pro- cedure not only saves time in recording and space on the work sheets, but simplifies the task of classifying the information that will later be taken from these work sheets. Each section on the work sheet represents one question on the schedule and must include enough separate columns to accommodate every possible reply that is expected to that question. In transferring the information from the schedule, a check mark or other equivalent symbol is made in one column of each section. Thus the number of cases in any column can be totaled easily by counting the check marks in that column. The sum of the column subtotals in every section should all give the same result, which should be the total number of schedules being tabulated. 1 In planning the headings of the sheet, it must be anticipated that for practically every question there are likely to be unexpected replies or unusual cases requiring notes of explanation, as well as instances where no answer appears.! It is necessary therefore to provide extra columns in nearly every section, such as those in Figure 15 marked "Notes," "Other" and "Don't Know." By checking "Don't Know," 128 BUSINESS STATISTICS FIGURE 15 PROPOSED WORK SHEET FOR QUESTIONNAIRE USED IN COLLEGE MARKET INVESTIGATION, FIGURE 5, PAGE 105 RADIO SECTION s * s 3 0\ 00 rx >o in ^f a DATE BOUGHT OR BUILT s s & to 0\ <* to o\ R 0\ <^ ro o> ro 0\ o O\ i8 ^2 a M J O V u I'S wa O o o w u .2 fe o M X 1 h 1 M Q U PQ < 8 S is C/3 PH 3 3% 8 o U WvO OPQcj * 2 P PH o <- sss * o o o o 0*0 i-> +*r-< -g o w U JB EDITING AND PRELIMINARY TABULATION 129 a count of such cases can be included in the check total for that section. All written notes are confined to "Notes" and "Other," and do not interfere with the count of check marks. The simple process of totaling single columns becomes more com- plicated when cross-relationships are wanted. For example, if the make of the present radio were to be related to the make of the radio the student expected to buy during 1936, a two-way table would be prepared and the tallying method used. Only those who answered "yes" to the question, "Will you purchase a radio before 1936?" would be included. The completed work for 100 cases might appear as shown in Table 13. The facts could be read in the form shown, or as a final table by substituting figures for the tally marks. TABLE 13 PROPOSED TALLY SHEET USING PART OF INFORMATION RECORDED IN WORK SHEET, FIGURE 15, HYPOTHETICAL DATA SHOWING RELATION BETWEEN MAKE OF RADIO COLLEGE STUDENTS OWN AND MAKE THEY EXPECT TO PURCHASE MAKB OF RADIO OWNED MAKE OP RADIO EXPECTI TO PURCHASE A B C D F G H Other No Informa- tion Total A 111 1 11 1 1 1 11 11 B 11 1 1 1 1 111 9 c 1 11 1 11 1 1 8 D 1 11 1 1 5 E 1 1 1 111 1 1 8 F 1 1 111 1 6 G 1 11 1 1111 1 1 10 H 1 11 11 1 6 Other 11 1 111 1 1 1 1111 11 15 Home made 1 1 1 1 1 11 7 No informa- tion 11 1 1 1 mi nu 15 Total .... 9 11 3 11 10 8 9 7 11 21 100 After this has been done, it can readily be seen that the collected information is no longer in a preliminary state. The table represents a selection and combination of certain parts of the original data and can be read as follows: 130 BUSINESS STATISTICS a) Fifteen per cent did not state the make of radio owned and 21 per cent did not state the make of radio they would purchase. It is difficult to draw any conclusions from the remainder of the table with so much information missing. b) Radio "C" seemingly is in disrepute. None of the present own- ers would repurchase it and only three would turn to it from other makes. c) While there is considerable evidence of shifts in consumer pref- erence, the radio owned at present has some advantage over competing products except in the case of "C." d) At least five of the students who built their present sets would not do so again. Other combinations of data can be made from the information on the work sheet, using tally forms similar to that shown in Table 13. The purpose for which the information was gathered will determine what forms to use. Note, however, that the work sheet itself is in no sense a final form for the data but merely an intermediate device. It is seldom published even when a complete record of the statistical analy- sis is included along with the report of an investigation. Mechanical Tabulation When a large number of schedules is to be analyzed the task of tabulation becomes enormous. Likewise when a great amount of cross-tabulation is necessary, even though the number of cases is not so large, the task of preparing tables is likely to become the "bottle- neck" of the investigation. Under either of these circumstances the present practice is to abandon hand tabulation in favor of the use of machinery designed for the purpose. At no other point is the statistician so much favored by the developments of the machine age as in tabula- tion. Equipment is available to perform quickly and accurately the steps of sorting, counting, cross-tabulating, and recording in columnar form. / These advantages have led a great many business concerns to install mechanical systems for the maintenance of records of current opera- tions/The variety of uses made of the "punch card" system for this purpose is illustrated by the following examples: bad debt losses of members of a retail credit association; broker's record of security deal- ings with customers; merchandise control of a mail-order house; rec- ords of premium payments of a life insurance company; service record EDITING AND PRELIMINARY TABULATION 131 of employees of a shipbuilding company; deliveries to individual stores from the central warehouse of a chain grocery company; stock control in the warehouse of a chain grocery company. Principles. The basic principle of machine tabulation is that a hole punched in a card represents by its horizontal and vertical position a certain statistical fact. It becomes a permanent record that can be used in tabulation at any time by running the card through a machine. The first machine developed for this purpose was the "sorter." This machine will sort a pack of punched cards into numbered compart- ments according to any one set of information. Further refinement has led to an attachment which will count the number of cards going into each compartment; sorting-counting has thus been reduced to a single operation. The next step was the invention of the "tabulator," a machine that operates at a more complex level. After the cards have been sorted, the tabulator can add the amounts recorded on each and furnish a printed record of the total. For example, if cards were punched show- ing the weekly wage rates of a firm's employees, each card representing one employee, the tabulating machine could be set so that it would give a printed record of the number employed at each wage rate, the total earnings of each group, the total number of employees, and the total weekly payroll with a single running of the cards. Steps in the Process. Probably the only way to understand fully what the machines can do is to see them in operation. 2 It will be worthwhile, however, to present in some detail those parts of the process which receive the least attention in a practical demonstration: (1) prep- aration of a code; (2) transfer of collected information to a code sheet; (3) punching the cards; (4) sorting-counting; (5) tabulation of numerical information ; (6) cross-tabulation; (7) recording in tables. Some types of data will not require the use of all of the steps, but steps 1, 3, 4, and 7 will always be necessary. The code: (The preparation of the code to be used in transferring information from long-hand forms to a set of holes punched in cards such as those shown in Figure 16 requires considerable skill. Each card 2 These machines are manufactured by the Electric Accounting Machine Division of the International Business Machines Corporation and the Tabulating Machine Division of Remington Rand Inc. The authors have found local representatives of both companies willing to demonstrate the use of the equipment either to individuals or to groups of students. Seeing the machines at work is vastly superior as a teaching device to mere description of their operation even though aided by pictures. 132 BUSINESS STATISTICS FIGURE 16 PUNCHED CARDS FOR MECHANICAL TABULATION A. 80-column card (punched as coded in Figure 19)* / II I I 000|000000|000000 || 00000|QOO||000||||00||00|0|OOuOOOO|0|OD|0 0000000000 |0 00 0|| 000 2222222|2222|22|222|222l222222222222222222222222222 || 2|222222|222222|22|2 ||22222 3333|||333333|3 3 3333 33 33 3 333 33 3|33 33 3| 3 3333 333|| 3| 3 3 3 3 3 3 3 3 3 |3 3 3 3 3 3 3 3 3 3 333 3 3 33333 44444444444444444444444444|444|44444444444444444444444444|4444444444444444444444 55555555555|55j5555555|55|555555|55555l55555555555|55555l555|5555555555555555555 ||66666 66| 66 6b66l66666E66666G6 66 666666666|666666666666666666666666|666666666666b 818888888888 8888 8888 8888 88 88 8 8 888888 88 8 88 8 88 88 88 |8 8 8 8 8 8 8 8 8 8 8 8 8 8 |8 |8 8 ||8 |8 8 |||8 8 8 S 9 1 9 9 9 9 9 1 e 9 9 9 9 9 9 9 9 9 9 9 9 9 9 99999999999999999 9 9 9 9 9 9 9999 9 9 9 9 9 9999 9 999999999999 |9 9 9 |9 9 * i B M 50$Q _ \ _ LictNseo FOB usr 'INQFB ptirm 1777*9? _ ] _ 1!1!1_ Reproduced through the courtesy of the Electric Accounting Division of the International Business Machines Corporation. B. 90-column card * f *2 ': 1 2 1 2 4 ^2 4 1 2 1 2 T i '2 1 '4 '4 '4 '4 '4 '4 '4 '4 '4 '4 '4 '4 4 '4 '4 '4 '4 '4 '4 J 4 '4 '4 J 4 4 J 4 J 4 '4 '4 '4 '4 4 J 4 ' '4 '4 '4 '4 J 4 >4 J 4 4 4 '4 * $ 6 '6 '& <6 5 6 5 6 5 6 5 6 5 6 ? 6 5 6 5 6 5 6 5 6 5 6 S 6 5 6 5 6 5 6 5 6 V, 5 6 5 6 5 6 5 6 S 6 6 5 6 5 6 $ 6 5 6 5 6 ? 6 5 6 5 6 5 6 5 6 *6 *6 5 6 5 6 5 6 '* 1 7 I 7 . 7 7 I 7 . 7 7 '! 7 8 7 , 7 8 'g ', ' 8 1* ', ', 7 g 7 g 7, 7 g 7 g 7g 7 g 7 g 7, 7 g 7, 7, 7 g 7g 7 g 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7 g V V ^ V ^ V f V f H ?i ?t M ?-i ?t ' 4 . ?t !t ?i 'i, ^ 't ^ 'i ?t ,i ?, i It 't 't ' f , 'i- 't 'i 't !t 't *V 4 9 t 't It ft It '2 '2 r + 9 4- V V V V "+ V V "-4- 9 + + V + 4- V Q 4 l- f/ 4- "+ 4 f "t "t- 9 1- 9 -f 9 -f + 9 -f 9 -f 9 -f \ \ "V + *+ + "-f V II Bt >1 34 US 5* 57 81 It 40 41 42 41 64 49 64 47 44 49 70 71 72 7 74 75 74 77 74 7* 40 41 4i 4 44 44 47 tO * The 90-column card is made possible by dividing a 45-column card horizontally, allowing fix spaces in each half column. The machines can be set to use either the upper or lower half of the card. The six spaces are used for 11 numbers by means of a combination punch, e.g., in column 1 the space above the line is punched for zero; if the 1 space alone is punched as in column 2 it represents 1: It and 9+ as in column 3 represents 2; etc. t Reproduced through the courtesy of the Tabulating Machine Division of Remington Rand Inc. usually represents one schedule or other individual record. Each ques- tion on the schedule is represented by one or more vertical columns on the card, while the numbers 0-9 in the column signify the various pos- sible answers to that question. The actual work of preparing a code can be visualized from the example shown in Figures 17 and 18. Figure 17 is a reproduction of EDITING AND PRELIMINARY TABULATION 133 page 1 of a schedule used in collecting housing information in Buffalo, New York, in 1930. Figure 18 is a reproduction of the instructions for coding the answers to questions A-l, A-2 and A-3 (a), (), (c), and (d). Columns 1-3 make a direct transfer of the serial numbers of the schedules to the card. Column 4 shows the simplest type of transfer of non-numerical information to the card. The note attached to the code for column 4 indicates that the information in question A-2 was not coded. This was omitted because only a few of the houses in the study showed any variation in construction material. Column 5 involves the same transfer of non-numerical information as column 4, although not so easy to record because each number in the column stands for a combined occurrence of cellar and attic. Column 6 is a simple transfer of numerical information. Column 7 is a combination transfer like column 5. Columns 8 and 9 use a somewhat complicated plan for recording the other rooms in a house. For example, four in column 8 and seven in column 9 means that the house contained a dining-room, an entrance hall, an inclosed porch, an open porch, and one other room. In the same way other numbers in the two columns indicate various combinations of rooms in a house. Column 10 gives a summary of the detailed information recorded in columns 7, 8, and 9. The note appended to the code for column 10 is necessary as a guide for the coder. In columns 7 and 10 the symbols X and B appear. The machines will sort on 12 items in a column since two extra punches can be made in the upper margin; therefore there is provision for two extra items in a column if necessary. This was done in column 7 to allow for two additional combinations of bedrooms and baths. Likewise in column 10 the largest house in the study contained 11 rooms and a few schedules were indefinite on the total number of rooms. It is not sufficient to prepare a code that will provide for all the information on the schedule, but the code should be so arranged that the desired tabulations can be taken from the machines with a minimum of sorting and cross-tabulating. In short, the person preparing the code must be familiar with all of the machine operations as well as with the method of subsequent analysis. For example, a code may be required that will designate each of the 48 states and the District of Columbia. It is obvious that two columns are needed in order to include so many items. Experi- ence shows that to use such a code as Alabama-00, Arizona-01, Arkan- 134 BUSINESS STATISTICS FIGURE 17 THE PRESIDENT'S CONFERENCE ON HOME BUILDING AND HOME OWNERSHIP Form One (Partial Reproduction) Case No. Information from Home Purchasing Families A. Description of house 1. Type of construction Single house Income house Two-family house Other (specify) 2. Material Superstructure: Cellar: Roof: Frame Stone Shingle Brick veneer Tile (a) Plain Brick Concrete block (b) Treated Tile stuccoed Other (specify) Fibre shingle Other (specify) Slate Other (specify) 3. Rooms and space included (a) Is there a cellar Attic: Finished Unfinished (b) How many floors including cellar and attic (c) Is there a separate dining room dining nook or alcove kitchen pantry laundry room entrance hall (d) Number of bedrooms bath rooms lavatories glass enclosed porches screened porches open porches other rooms (including living room, den, library, play room, sewing room, etc.) (e) Dimensions of house: Front Depth (/) Size of lot occupied by house: Front Depth (g ) Value of lot without house: At time of purchase Present assessment (h) Corner lot 4. Garage (a) Size: One-car Two-car Three-car (b) Construction: Frame Concrete block Brick Other (specify) EDITING AND PRELIMINARY TABULATION 135 FIGURE 18 INSTRUCTIONS FOR CODING DATA ON HOME BUILDING AND HOME OWNERSHIP IN BUFFALO, NEW YORK, 1930 (Reproduction of columns 1-10) Column 1-3. Serial number of schedule 001 = 1 526 = 526 999 = 999 Column 4. Type of house Single house 2 Income house 1 Two-family house 3 Three-family house (Norn: Although material about the type of construction of the house will not be coded, the coder is instructed to keep a list of the schedule number of every house that is not of frame construction and to note the exact type of construction of these exceptions.) Column 5. Cellar and attic No cellar and unfinished attic 1 No cellar and finished attic 2 Cellar and unfinished attic Column 6. Number of floors in house 1 One floor 2 Two floors Column 7 Number of bedrooms and bathrooms 3 Cellar and finished attic 4 Cellar and no attic 5 No cellar and no attic 3 Three floors etc. 1 1 bedroom 2 2 bedrooms 3 3 bedrooms 4 4 bedrooms 5 5 bedrooms 6 6 bedrooms and 1 bath and 1 bath and 1 bath and 1 bath and 1 bath and 1 bath 7 1 bedroom 8 2 bedrooms 9 3 bedrooms 4 bedrooms X 5 bedrooms B 6 bedrooms and and and and and and baths baths baths baths baths baths Column 8-9. Other rooms in house See Appendix 1 for code number to be used. Column 10. Total number of rooms in house 1 1 room 10 rooms 2 2 rooms X 11 rooms 3 3 rooms, etc. B Unknown Rooms to be counted in computing total for Column 10 Include: Dining-room Other rooms Kitchen Living-room Entrance hall Den Bedroom Library Inclosed porches Playroom Sewing room Exclude: Dining nook Pantry Laundry room Bathroom Lavatories Open and screen porch 136 BUSINESS STATISTICS sas-02, etc., is not efficient. It is better to use one column for geographical subdivisions and the second for the states in each as follows: Maine 00 Connecticut 03 New Hampshire 01 New York 10 Vermont 02 New Jersey 11 Massachusetts 03 Pennsylvania 12 Rhode Island 04 etc. Then, if only the major subdivisions are needed in a table, it will be necessary to sort on only the tens column instead of both. Other precautions can be taken to reduce to a minimum errors in transcribing data to the code sheet as well as in punching the cards. Whenever possible the number used in the code should correspond to the number written on the schedule. In Figures 18 and 19 it will be observed that this has been done for column 6, but not for column 4. In the latter case it was considered preferable to take off the answers in the same order in which they appeared on the schedule, but "income house" could have been placed first in preparing the schedule and coded as 0, one-family as 1, two-family as 2, etc. For the same reasons that determine the inclusion of extra columns in a work sheet, it is desirable to make allowance in the code for "no answer" or special cases which may require listing by hand. As far as possible the same number should be used in each column for such answers, the llth and 12th positions being used for this purpose. They are called "X" and "B," or "X" and "Y." r The code sheet: The complete code sheet reproduced as Figure 19 was used as an intermediate step between the transfer of the informa- tion on housing from the collection schedule (partially reproduced in Figure 17) to the punched cards. The sheet is divided into fields and each column is labeled to conform to the descriptions in the code (partially reproduced in Figure 18). The entries in each row of the body of the sheet are the code numbers which stand for the information contained in the schedule whose serial number appears at the left. Schedule No. 669 is recorded on Figure 19 to illustrate the procedure. This information is also reproduced on the 80-column card of Figure 16-A. The construction of the code sheet greatly facilitates the punching EDITING AND PRELIMINARY TABULATION 137 of the cards, but is not an absolutely necessary step in the process. The transfer of the information from written to numerical form can be done on the margin of the collection schedule. More time will be spent by the punch operator in taking the information from the margin of the schedule than from the code sheet. On the other hand the transfer from schedule to code sheet will require more time than cod- ing on the margin of die schedule. Which method to use must be determined in each case in the light of the particular circumstances involved. The three primary factors to consider are time, expense, and accuracy. For continuous recording of operations by business concerns the better plan usually is to arrange the original record so that the coding can be done on it directly without the use of a code sheet. The punch operator becomes so familiar with the code after a few months that the information can be punched directly from the original records without further reference to the code instructions. Punching the cards: The punching machine is just a simple device with 12 numbered keys fixed over a movable carriage containing the card. As a hole is punched in a column according to the coded infor- mation the card advances automatically to the next column in position for punching. These punches may be operated by hand or by electricity. ' Sorting-counting: This operation is the fastest part of the process. The machine sorts and counts several hundred cards a minute. In one type of machine the card itself acts as an insulator breaking an electric circuit. When the brush carrying the current comes to a hole in the card, the electric circuit is completed. The completed circuit in turn opens the guiding device to a compartment whose number cor- responds to that of the hole in the card, and the card drops into the compartment as it passes along on a conveyor. The other type of machine performs the same process except that the cards are picked out by pins that drop into the holes, instead of by a completed electric circuit. Both machines also have attachments that will count the number of cards falling into each compartment. Since the machine has only 12 pockets, it is obviously not possible to sort according to more than one column at a time. When it has been necessary to utilize two or more columns in a single code (as in Figure 19, columns 23 and 24) the cards must be run through separately for each column, if a complete sort is wanted. In these two columns the cost of the house is recorded in hundreds of dollars, the last two ciphers having been 1^ 2 SERIAL NUMBER CODE SHEET Used in Transferring Information From Questionnaire, Figure 17 (Code filled in for schedule No. 669) k TYPE HOUSE DESCRIPTION Jt M CELLAR u NO FLOORS I* si NO BEDROOMS -0 0> OTHER ROOMS rs g NO ROOMS O r HOUSE FRONT ^ fu HOUSE DEPTH Ji> bl LOT FRONT v J LOT DEPTH ^ Ul GARAGE flv g YEAR BUILT I rn i 5 1 ST SLD BLT O x> s LAST SLD- BUILT ru ORIG SELL PRICE ORIGINAL HOUSE COST fc Ul TOT COST PRES OWN O 2 DOWN PAYMENT * ^ ORIG VALUE 1ST MTGE 6 ORIG VALUE 2ND MTGE. t. i! AMORTIZATION g FIRST MORTGAGE MORTGAGE PAYMENTS ^ $ SECOND MORTGAGE i 1ST AND 2ND MORTGAGES^ <"| *^ FIRST HOLD- ER r IV) SECOND Cfc w FIRST WRIT- TEN ""S Jk SECOND Q> (n FIRST 01 SECOND z *? <*> ^ 3RD MTGE (O AGE BREAD WINNER (^ ^ OCCUR fc T WEEKS EMPLOY. Jo O i TOTAL EARN. Ci a TOTAL EARN. T) t>x >i DEPEND * s AGE BREAD WINNER *XJ 9 OCCUP ft N WEEKS EMPLOY. $S 5 TOTAL EARN $S ^ TOTAL EARN n ^ ,g DEPEND x 9 DOWN PAYMENT JU O EDUC VAC DOUBLE PUNCH do ^ O AUTO HELP CD Q -g FURNISH. EQUIP ^i ^ ^J CLOTHES INS Qo ^ bl SAVINGS Xj J MOVIES THEATRES *< Xj (A BOOKS MAGAZINES Oo o o> HOME IMPR en o J ENF SAV. INT Qo ^ <* 9 JOB SCHOOL o ^ MOVING O EDITING AND PRELIMINARY TABULATION 139 dropped. Thus if a count is wanted by $100 groups, the cards are first sorted on column 23, in $1,000'$, and each $1,000 is then re-sorted and counted on column 24 in $100 groups $2,000, $2,100, $2,200, etc Tabulation and cross-tabulation: The "tabulator" consists of sev- eral banks of small adding machines which may be arranged electrically to record, total, and print almost any desired combination of data from the cards. The same thing can be accomplished by the use of the sorter alone, but only after multiple sorting, rearranging, and computing. The difference can be demonstrated by further reference to Figure 19. It was desired to prepare a table showing the average cash payment according to the total price to the present owner in $1,000 intervals. Without the tabulator, it would be necessary to make a sort on column 23 and then to re-sort and count each pack separately according to columns 25 and 26. This would give results in the form of 12 fre- quency distributions each having possibly 100 class intervals from which the average cash payments could be derived. With the tabulator it was necessary to sort only on column 23. 8 The tabulator was then set to count the items in each $1,000 price group, to add the exact amounts recorded in columns 25 and 26 and to print the subtotals and grand totals as shown in Figure 20, columns 1, 2, and 3. The imounts of the first and second mortgages, columns 4 and 5, were also totaled during this same operation. Reading across the first row, the total cash payments made by 11 purchasers of houses costing between $2,000 and $3,000 amounted to $4,700; the total value of the first mortgages of these 11 houses was $17,900 and the total value of the second mortgages was $4,800. Recording results in tables: When the sorter alone is used a large number of blank forms must be prepared in advance for the recording of any possible cross-tabulation that may be needed, but with the tabulator this task becomes much lighter, When used in conjunction with the tabulator, the principal function of the sorter is to arrange the cards in consecutive order according to some one classification which will become the stub of the resulting table. If this classification has required the use of two columns on the card, the right hand or units column should be sorted first. These ten packs are then piled together in order, with zeros at the bottom and are run through the sorter on the left hand or tens column. Since the cards at the bottom of the pile pass through the sorter first, the unit zeros are sorted first and fall at the bottom of each tens compartment, followed by the unit ones, twos, etc. When the tens packs are then piled together, with zeros at the bottom, the entire set of cards is in consecutive order, and is ready to tabulate. The tabulator can be set to separate them according to each unit, if de- sired. In the case illustrated, only the tens values were needed, so it was necessary to sort and tabulate only on column 23, disregarding column 24 entirely. 140 BUSINESS STATISTICS FIGURE 20 REPRODUCTION OF THE PRINTED RECORD FROM THE TABULATING MACHINE WITH HEADINGS ADDED, DATA FROM CODA SHEET, FIGURE 18 (1) COST OF Housi TO PRESENT (2) NUMBER J 3 > TOTAL OF CASH TOTAL VALUE OF FIRST (5) TOTAL VALUE OF SECOND PURCHASER ($1,000) OF CASKS PAYMENT ($100) MORTGAGES ($100) MORTGAGES ($100) 2 11 47 179 48 3 27 238 411 302 4 32 357 772 296 5 158 1186 4631 2748 6 134 1816 4574 2091 7 95 1998 3559 1497 8 62 1173 2685 1294 9 41 932 1999 842 23 629 1178 532 1 14 417 798 338 t 22 968 1292 526 619* 9761* 22078* 10514* t The information in this row was recorded in the B position of the cards. The tabulator did not print the B. * Asterisks denoting totals are part of the machine record. The first step is the preparation of the plan for the derivative table which will result from the tabulation. Then an outline form of the primary table is made. This is used by the machine operator in arrang- ing the machine. When the cards are tabulated the results are printed by the machine in a form similar to Figure 20. This is the primary table. From the primary table the derivative table can be constructed. TABLE 14 DERIVED TABLE FROM TABULATING MACHINE RECORD, FIGURE 20: ARITHMETIC AVERAGE OF CASH PAYMENT AND ORIGINAL FACE VALUE OF FIRST AND SECOND MORTGAGES FOR DIFFERENT PROPERTY COSTS. RESTRICTED TO PROPERTIES PURCHASED IN 1922 OR AFTER. 619 BUFFALO FAMILIES COST OF PROPERTY TO PRESENT OWNER No. OF CASES AVERAGE CASH PAYMENT AVERAGE FACE VALUE FIRST MORTGAGE AVERAGE FACE VALUE SECOND MORTGAGE $2,000- 2,999 11 $ 430 $1,630 t 440 3,000- 3,999 27 880 1,520 1,120 4,000- 4,999 32 1,120 2,410 920 5,000- 5,999 158 750 2,930 1,740 6,000- 6,999 134 1,360 3,410 1,560 7,000- 7,999 95 2,100 3,750 1,580 8,000- 8,999 62 1,890 4,330 2,090 9,000- 9,999 41 2,270 4,880 2,050 10,000-10,999 23 2,730 5,120 2,310 11,000-11,999 14 2,980 5,700 2,410 12,000-15,999 22 4,400 5,870 2,390 Total or average 619 1,580 3,570 1,700 EDITING AND PRELIMINARY TABULATION 141 In this case Table 14 was constructed by dividing the totals in each row of columns 3, 4, and 5 of Figure 20 by the number of houses in the row, column 2. Summary of Mechanical Tabulation. It is hardly to be expected that the whole process of mechanical tabulation will be clear merely from reading this description. Perhaps it will serve as a guide as to what to look for in a demonstration of the machines. Mechanical tabulation is a great assistance in accounting and statistical work both internal and external. But it cannot be used to advantage in all cases. Certain types of work call for the mechanical process; others are not adapted to its use. The outstanding criterion is the size of the investigation. If either a large number of cases or a large amount of cross-information is involved, the mechanical process should be used. PROBLEMS 1. What routine does an editor follow in searching for irregularities in schedules? 2. What is meant by re-editing? When should the process be employed? 3. What would you include in the qualifications of an editor? 4. The following is a card from the investigation illustrated in Figure 13. This card was returned to the collecting agent by the editor. What do you think the editor found wrong and what did he want the agent to do? COLLECTION CARD USED IN RESIDENTIAL VACANCY INVESTIGATION IN BUFFALO, N. Y. Serial No Address 526 So. Elm St. front and rear houses Ward Tract Enumeration District No. of Dwelling Places in Building One Two Three ~ Four... .4 - Over Four (give number) Occupied 2 Vacant 1 Residential X Combination X A A. S. Agnew Agent S 142 BUSINESS STATISTICS 5. Describe in detail an investigation in which you would use sorting-counting at the preliminary tabulation stage. 6. Describe in detail an investigation in which you would use a tally sheet at the preliminary tabulation stage. 7. Describe in detail an investigation in which you would use a work sheet at the preliminary tabulation stage. 8. What is the principle of mechanical tabulation? 9. What are the advantages of mechanical tabulation? 10. Describe an investigation in which mechanical tabulation should not be used. 11. The following is an approximate reproduction of the invoice form used by a firm having 63 salesmen, 4,000 customers, and listing in its catalogue 13,200 commodities classified in 12 departments. JONES SMITH INC. HEAVY HARDWARE "Serving the United States" Date Invoice No Dept Salesman Territory SOLD TO Quantity Commodity Unit Price Amount Prepare a code for transferring all of the information on this invoice to 45 -column mechanical tabulation cards. The code should be prepared so that information could be taken from the cards concerning any of the fol- lowing either separately or in combinations: sales from day to day, sales by departments, the record of any one individual invoice, the amount of goods sold over a period to any individual customer, the distribution of sales by states, cities, or territories, the sales made during a month by each sales- man, the quantity of each commodity sold during a period, the amount of sales during a period. EDITING AND PRELIMINARY TABULATION 143 REFERENCES BAILEY, WILLIAM B., and CUMMINGS, JOHN, Statistics. Chicago: A. C Mc- Clurg & Co., 1917. Chapter IV contains a complete and lucid statement of the purpose of edit- ing schedules and the duties of an editor. BROWN, LYNDON O., Market Research and Analysis. New York: The Ronald Press Co., 1937. An excellent statement on editing and methods of testing a sample appears in chapter 12 and rules for preliminary tabulation are given in chapter 13. EIGELBERNER, J., The Investigation of Business Problems. Chicago and New York: A. W. Shaw Co., 1926. Chapter XIII explains the function of the editor as official critic of the quality of collected information. SCHLUTER, WILLIAM C., How To Do Research Work. New York: Prentice- Hall, Inc., 1929. A concise statement of the editing process appears in chapter XL Standards of Research. Des Moines, Iowa: Meredith Publishing Co., 1929. An excellent statement of the functions and qualifications of an editor appears on pages 29-32. Instructions for mechanical tabulation are given on pages 32-36 and a standard code on pages 69-75. CHAPTER VIII TABULATION DEFINITIONS THE TABULATION of statistical data is the orderly arrange- ment of concrete numerical information in vertical columns and horizontal rows. This definition excludes lists of facts which are not numerical and mathematical tables which deal with abstract numbers. It contains three separate concepts: (1) numerical information regarding actual items, events, values, or relationships; (2) a definite order of arrangement for this information; (3) the preparation of forms with rows and columns in which the numerical data may be recorded according to their orders of arrangement. The method of collecting concrete numerical information has been discussed in chapters IV and VI. It has already been indicated in these chapters that the final form in which the collected data are to be arranged must be anticipated to some extent in planning the collection, and also in the stages of preliminary tabulation as explained in chap- ter VII. However, the thorough analysis of the various orders of arrangement has been reserved for this chapter. The latter part of the chapter is concerned with the third factor in tabulation, the prin- ciples and practices in preparing tabular forms for the recording of classified statistical data. | Two kinds of statistical information may be presented in tabular form: (1) several sets of more or less heterogeneous information, and (2) data representing a definite universe expressed in a common unit. In the first kind of table the several sets of information are not expressed in the same unit, but they are arranged according to a single common characteristic, such as the dates of successive observations. Grouping such data in tabular form is a space-saving device very often used and is a legitimate statistical technique provided that the different sets of information bear some relation to each other. The other kind of table contains homogeneous data employing a common unit and arranged according to one or more definite orders of classification. The ensuing discussion of the elements and orders of classification deals with the second kind of table.] 144 TABULATION 145 Classification Classification is the arrangement of a set of observations or com- puted figures according to a previously determined plan. This arrange- ment may involve the separation of a whole into parts or the listing of related sets of information* Elements of Classification. Each plan of classification involves several indispensable elements: 1. The data have been collected and arranged for some definite statistical purpose. 2. The enumeration or collection has involved one unit, that is, the items counted were all defined in the same way. 3. These similarly defined units each possess one or more of the same variable characteristics so that all of them can be classified accord ing to each of these characteristics. 4. For each classification that involves the separation of a whole (that is, a total of identically defined units) into its parts, the classes must be mutually exclusive. The foregoing statements all refer to data in the original form in which they have been collected, that is, to primary data. The distinction between primary and derived tables will be mentioned later, but for the present the elements of classification and the various orders of classification are discussed in the simplest terms, as applied to pri- mary data. Purpose: The purpose is so closely connected with the other ele- ments of classification that it requires little elaboration./ For example, we wish to know how many students in a freshman class are men and how many are women. "Freshman students," therefore, is selected automatically as the unit to be counted, and the variable characteristic according to which the units will be classified is "sex." Since this variable can have only two classes, male and female, a very simple classification will result: 1 This definition differs materially from that given in Edmund E. Day, Statistical Analysis (New York: The Macmillan Company, 1925), pp. 36 and 42. Day distinguishes classification the separation of a larger group "population" or "universe" into smaller groups or classes on the basis of a specified criterion or characteristic featurefrom sertation the arrangement of an orderly succession of items relating differences in one variable to differences in another. This distinction, although theoretically a fundamental one, is disregarded in the discussion of the principles and methods of tabulation in this chapter, because classification and seriation both follow the same rules with regard to arrangement. The distinction made by Day is referred to at the end of chapter XIII, in introducing graphic methods that involve two variables. 146 BUSINESS STATISTICS Male 125 Female 98 Total freshmen 223 This elementary illustration shows why the final objective must be kept clearly in mind at the initial stages of the investigation. If no provision were made for recording male or female on the student's registration card, there would be no reliable way of counting the num- ber of men and women, since such names as "LaVerne" or "Marion" may be of either sex. / Units: In any table of primary data, the unit is whatever is being counted.. One should be able to select any single figure from a table, for example, the first figure from Table 1, chapter I, and ask, "804,350 what?" The answer which can be read from the title of the table is always in terms of the unit forming the basis of that particular tabu- lation, and /'/ will be equally applicable to every figure in the table. In this case the unit is passenger automobiles sold. Note that, although one might also say 804,350 Chevrolet s were sold in 1937, these further limitations do not apply to all of the figures in the table. They are therefore not part of the definition of the unit of the table. Chevrolet is one class of one of the variable characteristics "make of automobile" and 1937 is one class of another variable characteristic "model year." Variable characteristics: Collected data may possess one or more variable characteristics according to the degree of detail necessitated by the purpose of the investigation.! In the foregoing illustration, the variable characteristic "make of automobile" was subdivided into the classes, Chevrolet, Ford, and Plymouth, and the variable characteristic "model year" also had three classes, 1937, 1938, and 1939. In the previous illustration the freshmen who were classified by sex might have been differentiated according to a second characteristic, "university division." If a similar count were made in several universities, for several successive years, the complete data would contain four variable characteristics: university, division, sex, and year of entrance. Each student counted could be distinguished according to all four of the characteristics; for instance, student No. 1 might be a man, at Uni- versity C, in Business Administration, entering in 1940. Mutually exclusive classes: In the total count, therefore, each student would fall into one, and only one, of the possible categories under each of the four variable characteristics, that is, in one class of each classi- TABULATION 147 fication. The class "entered in 1940" could not include any of those counted as "entered in 1941"; none of the freshmen entering Univer- sity A could also be counted as entering University B, etc. Likewise in Table 1, chapter I, each one of the 4,613,424 passenger cars sold during the 3 years is included in only one of the individual figures (known as "cells") of the table, except that the figures of the last row are subtotals, each one including all items in the column above it. It will be explained later that classification according to two or more characteristics may be crossed in a single table, as in the table of automobile sales. However, there will be no overlapping of the several classes in any one classification unless two or more variable characteristics become confused in listing the classes. If this is allowed to occur, such classifications as the following may result: HOMICIDES IN THE CITY OF NEW YORK, 1926 Manhattan 218 Shooting 213 Brooklyn 82 Assault 53 Bronx 19 Stabbing 54 Queens 22 Gas 7 Richmond 3 Infanticide 7 Whole city 344 Poison By Negroes 41 Accidental 10 By husbands 4 By police 19 By wives 3 Suicide 10 In this table one must assume that the figure in the sixth row (whole city) is the total, since it is the sum of the figures for the five boroughs. These six rows alone comprise a correct tabulation of the whole number of homicides divided into five mutually exclusive classes according to the single characteristic, place of occurrence. The re- mainder of the table lists homicides according to at least three addi- tional characteristics: race of the person committing the crime, relation- ship to the victim, and cause of death. If all of the classes of these several classifications were given so that each classification included all of the 344 cases, the table would not be incorrect, although it would afford no cross-classification. As it stands, all the classes appear to be parts of a single classification, but there is overlapping between them. For instance, some of the 41 homicides by Negroes probably were committed in Manhattan, some of the Negroes may have been husbands of the victims, and some of the stabbing cases may have also been committed by Negroes. A single homicide could have answered to every characteristic that has been suggested; hence the classes are not mutually exclusive. 148 BUSINESS STATISTICS Orders of Classification, Time, Space and Attribute. There are three main types of characteristics with respect to which data may be classified: (1) variation in time, (2) variation in space, and (3) vari- ation of an attribute. In a previous example freshman students were classified according to (1) time (year of entrance), (2) space (loca- tion of the university), and (3) the two attributes, sex, and university division in which each one was registered. Thus all three orders of classification are represented in this example. Another case of the three orders in a single set of data can be found in the tabulation of yearly car loadings in the United States, by "months," by "regions," and by "size of railroad." On the other hand many tables will contain only one or two orders of classification, or there may be two or three variable attributes and no variation in either time or space. In all of the examples that have been mentioned, the number of units in each class was obtained by counting objects, that is, freshmen, automobiles sold, loaded cars, and even the cases of homicide, accord- ing to their several variable characteristics. In many tabulations, how- ever, counted objects or persons are replaced by prices or rates. The data may be expressed in units of dollars and cents but each figure is actually a ratio meaning "number of dollars paid per bushel" or "per ton," etc. Such data may be classified in the same way as countable units, according to mutually exclusive variable characteristics, which may relate to changes in either the numerator or the denominator of the ratio. Table 15 contains three such classifications of wheat prices, each of which illustrates one of the orders of classification. Table 15-A shows changes in the unit "Average Weekly Cash Price of No. 2 Hard Winter Wheat at Chicago" classified according to time of occur- rence. The four weeks quoted constitute four classes in this time classification. Other characteristics are held constant by the definition of the unit, i.e., all prices are taken from the same market, for cash sales, for the same grade of wheat. There is no overlap in the classes of the classification. Table 15-B shows changes when the unit "Average Cash Price of No. 2 Hard Winter Wheat for the Week of October 3-8, 1938" is classified according to the variable characteristic place of occurrence. The three cities named as markets are three mutually exclusive classes in the spatial classification. Other characteristics are held constant by TABULATION 149 the definition of the unit, i.e., all prices are taken for the same time period, for cash sales, for the same grade of wheat. Table 15-C shows changes in the unit "Average Cash Price of Spring Wheat at Minneapolis for the Week of October 3-8, 1938." The five grades of wheat in the attribute classification are clearly defined and non-overlapping in accordance with the definitions established by the United States Grain Standards Act. Other characteristics of the unit are held constant by the definition, i.e., all prices are taken for the same time period, for cash sales, in the same market. TABLE 15 THREE TYPES OF CLASSIFICATION* A. TIME AVERAGE WEEKLY CASH PRICES OF No. 2 HARD WINTER WHEAT AT CHICAGO FOR FOUR WEEKS OF OCTOBER, 1938 AVERAGE PRICK WEEK PER Bu. Oct. 3-8 $ .669 Oct. 10-15 672 Oct. 17-22 674 Oct. 24-29 680 B. SPACE AVERAGE CASH PRICES OF No. 2 HARD WINTER WHEAT IN THREE MARKETS FOR THE WEEK OF OCTOBER 3-8, 1938 AVERAGE PRICK MARKET PER Bu. Chicago $ .669 Kansas City .638 St. Louis 678 C. ATTRIBUTE AVERAGE CASH PRICES OF FIVE GRADES OF SPRING WHEAT AT MINNEAPOLIS FOR THE WEEK OF OCTOBER 3-8, 1938 AVERAGE PRICK GRADE PER Bu. Dark Northern Spring Heavy. . No. 1 I .738 Dark Northern Spring No. 1 .733 No. 2 701 Northern Spring No. 1 .640 Hard Amber Durum No. 2 .651 Crops and Markets, United States Department of Agriculture, Vol. XV, No. 11 (Novem- ber, 1938), p. 254. The three types of classification can be easily distinguished by noting that in Table 15-A all observations refer to a single place and invariant attributes, time being variable; in Table 15-B all observations refer to a single time interval and invariant attributes, location or space being variable; and in Table 15-C all observations refer to a single time period, a single place, and attributes that are invariant except the 150 BUSINESS STATISTICS attribute of grade of wheat. Greater care is necessary in dealing with variable attributes than with variable time or place because a universe can have many different attributes. For example, classifications of prices of spring wheat could also be made according to percentage of dockage, kind of contract of sale, or type of purchaser. Attribute classifications are sometimes divided into qualitative and quantitative. Table 15-C is an example of a division according to the qualitative attribute, grade of wheat. Attribute classifications are quan- titative when the attribute is expressed numerically, as in size or price groups. Such classifications take the form of frequency distributions, the discussion of which is deferred to chapter XV. TYPES OF TABLES Statistical tables can be divided into two categories primary and derivative. Primary Tables ,' A primary table is a full presentation of the collected data in the original units. An investigation of any complexity may require several primary tables. Such complete tables serve as a basis from which the statistician selects certain related sets of data that may be presented in various ways, depending on the purpose in view. If the original data are to be published for general use without knowledge of what the uses will be or which relationships will be considered most important, primary tables may be given in full. Due to the expense of publication such tables are not commonly found in print, but certain parts of the original data such as a group of subtotals comprising a grand total may be published in order to bring the important data together in compact form. Derived Tables A derived table is one that presents the results of some analysis of the original data, such as percentage distributions, per cents of increase or decrease, values per capita, index numbers, or coefficients. These are constructed from the original data by the application of statistical methods and may be published either alone or accompaniM by a part or all of the data upon which they depend. TABULATION 151 The chief requirement of a derived table is that it should present one unified set of relationships. An attempt to set forth too many ideas in one derived table usually results in confusion. The preferable method is to use several short clear tables each of which has one defi- nite purpose. Thus one primary table frequently becomes the source for many derived tables/ Parts A and B of Table 16 show two entirely different sets of infor- mation, but both were drawn from the same primary table which gave the distribution of explosives workers of each grade of skill according to average hourly earnings. TABLE 16 Two DERIVED TABLES FROM ONE PRIMARY TABLE* A PERCENTAGE DISTRIBUTION OF EXPLOSIVES WORKERS, BY AVERAGE HOURLY EARNINGS AND SKILL, OCTOBER, 1937 AVERAGE HOURLY EARNINGS (IN CENTS) SKILLED SEMI- SKILLED UN- SKILLED TOTAL Under 37.5 and under 42.5 and under 47.5 and under 52.5 and under 57.5 and under 62.5 and under 67.5 and under 72.5 and under 77.5 and under 82.5 and under 87.5 and under 92.5 and under 97.5 and under 102.5 and under 107.5 and under 112.5 and under 125.0 and over . 37.5 .4 .3 .5 .5 2.3 4.1 7.5 7.5 7.8 10.7 14.8 10.7 10.2 7.5 5.2 5.3 3.5 1.2 .7 1.4 1.0 2.5 5.4 10.3 16.3 14.2 15.6 11.5 13.5 3.6 2.3 1.1 .2 .4 2.2 1.9 3.9 9.3 17.0 20.6 14.1 15.6 7.1 4.4 2.0 1.0 .5 .3 '.i .8 .9 1.2 2.7 5.8 8.8 11.2 10.8 9.9 9.8 12.1 7.0 6.2 4.4 2.9 3.0 1.8 .7 42 5 47 5 52.5 57.5 62.5 67 5 72 5 77 5 82.5 87.5 92.5 97 5 102 5 ... 107 5 112 5 125.0 Total 100.0 100.0 100.0 100.0 HOURLY EARNINGS RECEIVED BY EXPLOSIVES WORKERS IN THE UNITED STATES ACCORDING TO GRADES OF SKILL, OCTOBER, 1937 GRADES OF SKILL HOURLY EARNINGS RECEIVED IN CENTS Middle Wage Received by Lower Half of Workers Median Middle Wage Receired by Upper Half of Workers Skilled 73.7 63.6 54.8 85.3 71.9 61.3 96.4 80.8 69.4 Semi-skilled Unskilled * Adapted from Monthly Labor Review. United States Department of Labor. Bureau o. Labor Statistics. Vol. 47. No. 2 (August. 1938). pp. 383. 384. 152 BUSINESS STATISTICS Table 17 furnishes an example of a derived table in which some of the original data are presented along with the analysis. It appears rather complex but a study of the per cent columns reveals that only one form of analysis is being presented, namely the per cent of dwell- ings vacant in each ward for the various types of buildings. ESTABLISHED PRACTICE IN THE CONSTRUCTION OF TABLES The difference in purpose and content between primary and derived tables has been explained, and in general the rules for tabulation will apply to either kind. Derived or summary tables appear in print more frequently and they present the chief problems in table construction from the point of view of utility to the reader. The users of statistical analyses may be divided into two groups, those who will read tables and those who will not. As far as the second type of readers is con- cerned tabular matter might just as well be kept out of print. For their benefit it is necessary to point out in the text the most important information contained in any table. Those who will read tables would prefer to have the textual description omitted so that they can draw their own conclusions from the data presented. For the sake of this group the tables must be made as effective as possible. Certain principles and practices which contribute to that end have become well established by usage and should generally be followed, although occasionally some deviation from customary procedure may increase the effectiveness of a table. In such cases it is more important to use good judgment than to follow rules slavishly. Unity The data contained in a table should pertain to one definite subject, should be confined to that subject, and the table should include what- ever information is pertinent to a complete presentation of the subject. In Primary Tables. In a primary table there can be no question as to unity regardless of the degree of cross-classification if the subject is presented in terms of a single unity as illustrated by Table 12, page 124, in chapter VII. Other primary tables may contain information expressed in several units but with no sacrifice of unity due to the fact that significant ratios can be derived from combinations of the various sets of data. Table 18 is of this kind. The primary data given in the table are in three differ- TABULATION 153 PQ *5 32: a Q S 3rt H I Tw q ; o q |Ooqr^qqqqq q "OPqqqiHoqf<%irjir\|rH 6 fc 6 fc O -AO r> SP^^^rl^ 1 ^ tfN *o^ 1 1 v ^?r l 5'0' H I r** -..'<J'MOOOOOO'*f v OC400<NOOOO<NOOO <N A(N(TkfM rHrH |^.(N mVO^ H r (vi m o\ n cs r^ CM vo vo NO t o en -^ oo I s - r^ NO P 4 W C4 N <N (N (N <N 154 BUSINESS STATISTICS ent units, number of telephones, number of messages, and number of dollars of operating income. However, all of them deal with the one subject, operations of the telephone company. Such ratios as number of messages per telephone, income per telephone, income per message of each type, as well as indexes showing relative changes in each unit or in each ratio, afford numerous possibilities for analysis. In Derived Tables. In a derived table simple units are replaced or supplemented by such measures as compound units, averages, and percentage relationships. These measures are usually based upon sev- TABLE 18 OPERATING STATISTICS OF THE BELL TELEPHONE SYSTEM, 1931-36* YEAR (1) NUMBER OF TELEPHONES (2) (3) NUMBER OF MESSAGES (4) (5) OPERATING INCOME (6) TOTAL Local (000 omitted) Toll and Long Distance (000 omitted) Local Service Toll Service 1931 1932 1933 1934 1935 1936 15,389,994 13,793,229 13,162,905 13,378,103 13,844,663 14,453,552 22,704,825 21,525,558 20,147,635 20,676,520 21,465,285 22,869,510 985,500 823,866 747,155 781,830 830,740 911,340 $723,920,495 670,736,747 617,253,153 607,676,275 640,993,436 665,152,512 $326,268,854 263,147,955 243,905,775 258,691,363 273,483,256 306,238,511 $1,050,189,349 933,884,702 861,158,928 866,367,638 914,476,692 971,391,023 * Annual Reports of the American Telephone and Telegraph Company. eral separate sets of data but the derived table becomes a unified whole if all of the relationships included contribute to a single purpose/ An example of a compound unit is the "ton-mile," a measure of operating density in railroading. It represents one ton moved the distance of one mile and is derived from the two simple units "tons of freight" and "miles operated." Similarly in other lines of activity "man-hours," "dollar-years" and "foot-pounds" appear. Each of these is more complex than a simple unit yet each presents a single concept so that its use results in a unified table. Table 17 is an example of a derived table in which per- centages accompany data that have been classified in two directions. These percentages are based upon a third classification, number of dwellings occupied and vacant, the original data of which are not included. All of the information in the table contributes toward the one subject, percentage of vacancy in dwelling places. It has already been noted that a derived table loses its effectiveness if it presents too TABULATION 155 many kinds of relationships. A number of additional sets of percentage relationships could be worked out from the original data of Table 17, such as percentage distribution of vacancy by wards or by type of building, but if any of these were included there would be a loss of unity and the resulting table would have no definite purpose. Example of Lack of Unity. Many tables appear in print which do not possess the unity found in Tables 17 and 18. An example of heterogeneous data presented in condensed form is shown in Table 19- The information on types of freight car loadings together with the Federal Reserve index forms a table complete in itself. "Pullman passengers carried" has nothing to do with freight car loadings nor with "financial statistics." "Canal traffic" on only two canals, and measured in two different kinds of tons, is possibly the most important information on the subject of water traffic, but more complete data should be given in a separate table, since there is no possible way of relating this information to that concerning freight car loadings. Tables that contain several sets of unrelated information or of related data expressed in units that are non-comparable are justifiable only in publications in which space-saving is a more important consideration than unity. Complexity The orders of classification employed in an investigation depend upon the nature of the data and the purpose for which they are collected. The extent to which it is necessary to study combinations of the several characteristics of the data will determine the degree of cross-classification required in tabulation. Simple Classification. The first order of classification, commonly referred to as a "one-way table," has been illustrated in Table 15. In each part of this table the prices of wheat are classified according to a single characteristic and no difficulty arises in their presentation. A derived table that contains a single classification follows the same form of construction. Cross-Classification. If classification is desired according to two characteristics simultaneously it will not serve the purpose merely to list the two separately. They must be cross-classified in a "two-way" table. This obviously requires listing in two directions, consequently one classification will appear horizontally and the other vertically, as in Figure 21. In this case the kinds of animals slaughtered are listed 156 BUSINESS STATISTICS o < ge Pk rt ?. 5 ^ G iS-gS VO CM % 00 00 O s s p s a -H 1 ( H 1 4J D 1 i'g jl. o -5 S O > O W^ iTk O fOk VO C\ Xf 00 *H O\ ^k ^ f* ON f^ 3 c/S^S J3 w 5 H*S rf% i-4 r<C *\ ir\ 00 ft w AS *-t O\ f** 00 00 C\ VO T VO O r^ o\ in vo m M HH *0 . CS <N c<% >f <N . CN < 5 * i dx bo <u 20 s | i = J2 rN \o <% cs os r* fTv \O Xf C\ ON OV m \o o >O r\ r. 3~ u & V rt C C fc w | t-i i-* -! ON ON - O i- xf rr\ rr ^ i i * < 2<3 bo i w S G > S H rA CM rH ON 00 VO (N fN X}* fO <N <N oo ON CN o in ON PH Is O c ia*33& A <n 11 r-i -! r\ n ON o C<N \r\ \r\ oo r-- <N ITS <N CN <N CN cT O O i-t <N NO ir ON r- VN r^ ^ ON u-\ CN ON -* *- <N to " H S <N -* -l rn ^H H J !ai 5 CT\ i-l fO ITN Tk ITV CM ir w^ ^> O oo CN r>* r^ % c<% r-i rv O ON O oo "> JH u C 3 w JH o \O oo <N ON m r^ O OO O fTk (N I-* xr *- CM CN r* ^ 0> r- vo r- oo r- en CM n oo rv oo CM r- <N <N XT , i (O i ir\ r\ cTk ccv o I|I2 s ITV ^ r- C\ CO 00 ir\ r^ NO r\ r\ ir\ O C>l r-< <-H l-H T-4 r-< . to 1! (A X3 G S O i-J 00 O (N NO TN \O T\ NO (N* f-J <N T- rH 1-1 r-i r-H CN d fc S -o .S ,5 3 o oo ON oo r-i o r- 00 J 2 c 2 2 o c5 w cS& S J5 r- r* ^ oo r> o rT\ CN| <% CN <N eC o* 3 ^ H|I NO ON ON rH fO ^f Sr^ i-J TN ^ r4 - C^l C4 CS cCv r-k K o w "3-e.S C p U O ON ^ VQ OS ^ \O NO CO ITN i-J <N |^ \o r-- oo T-. r-< i~, rH i-H r-( vH tC 6 ^ NO XT rTk CM 00 <N oo" 2 iS S il "8 ^1 1-4 c, II i/ ^7 ^JQo <N O <M r ON O ro o to o r- r^ O 10 ir> NO *"\ \O rH 1*- <N ITN "^ rH f\| O iri r NO vo r^- rH r^ I s' i on * 1"8 ri c| -S2 ^ tf S 9 1^- rH ^ cTk O -l o IA r* vo \o r rH | > S L> 5S Q '{? 3 a % < niiii f CO H ON <N Ck <^ ITN NO <N fO cT cTk rO fO ON ON ON ON O\ ON * TABULATION 157 FIGURE 21 FORM FOR Two- WAY CROSS-CLASSIFICATION NUMBER OP CATTLE, CALVES, HOGS AND SHEEP SLAUGHTERED BY THREE MEAT PACKERS IN 1937 PACKERS ANIMALS S LAUGHTERKD Cattle Calves Hogs Sheep Company X Company Y Comoanv Z Total horizontally across the top. These headings are known as the caption and the vertical lists of data beneath the several headings are referred to as columns. The names of the packing companies appear down the left side of the table. Any such vertical listing is termed the stub of a table and the horizontal lists of data following the several items are called rows. In order to determine the identity of a figure in any one cell of the table it would be necessary to follow the column up to the caption and the row across to the stub on the left. When three or more orders of classification are desired the problem becomes more difficult since a two-dimensional sheet of paper must serve as the medium for a three- or four-dimensional relationship. The only possible solution is to subdivide identically each class of one or both of the first two classifications. Figure 22 illustrates a "three-way" table, in which each class of animal has been subdivided to show two types of inspecting agency and the total of each class. Finally, either these same classes may be again subdivided or each of the classes in the stub may be subdivided to take care of a fourth classification, FIGURE 22 FORM FOR THREE-WAY CROSS-CLASSIFICATION NUMBER OF CATTLE, CALVES, HOGS AND SHEEP SLAUGHTERED UNDER FEDERAL AND CITY INSPECTION BY THREE MEAT PACKERS IN 1937 PACKERS ANIMALS SLAUGHTEBED AND INSPECTION AGENCY Cattle Calves Hogs Sheep Fed- eral City Total Fed- eral City Total Fed- eral City Total Fed- eral City Total Company X. . Company Y.. Company Z. . Total 158 BUSINESS STATISTICS ! ex 1 S3 ^ P I s H Is 2 ! C > Q 1 i I Is o ^ o I s .^ < f*- $ *J m CN| O c/ 5 OS ^J CO ty S S s O /^S t/1 i Is ^ ^ E IH !( EC4 O 5 S ^ a 3 as o ES :r! 131 b H 1-4 s t6 <3| ix. C Is o G PCS i s 1 I gs i III c i : : < ^ i 1 ! ^ c |j|l iiii 6 |||| -00 TABULATION 159 ACKERS U 52 So S 00 W H; ex, w S tf NUMBER OF - K C 3 ~l C 3 I J O, 55 U. o < o H w <* H <O * 6 N u -* O 160 BUSINESS STATISTICS resulting in a "four-way" table. Figure 23 in which grade of meat has been added, and Figure 24 adding a time classification, illustrate respectively the method of combining four and five orders of classifi- cation in a single table. Additional classifications could be introduced if there were others pertinent to the data. Inspection of Figures 23 and 24 shows, however, that they contain too much information to be comprehended easily. Whenever the further subdivision of data leads to tables which are too complex to be read easily, it is preferable to increase the number of tables. Do not spend time devising ways of presenting multiple classifications in a single table; make two or more tables instead. A fairly good, though not universal, rule is to confine a table to three classifications if it is being made for publication. From the point of view of construction the addition of a set of percentage relationships to the original data increases the complexity but does not add another order of classification. Thus if the percent- age of animals slaughtered by each packer were required in Figure 21, it would be necessary to introduce the subheadings "Number" and "Per Cent Distribution," under each type of animal in the caption. It would then have the appearance of a three-way table although only a two- way classification. Clarity The reader's ability to grasp the content and significance of a table depends primarily upon the clarity of wording in every part of the table. Careful attention must be given to the phraseology of the title and all headings, and to the inclusion of any necessary notes of ex- planation and reference. Title. The first essential is a title which in the simplest form will tell what is in the table. If several lines are required to describe the contents, a brief title explaining the major characteristics can be used with a subtitle in smaller print giving the more detailed description. The title should clearly name the unit or units in which the data are being presented including all the limitations on the data. These limitations usually include time, space, and exact specifications of the units employed. If a presentation of some derived relationship is the main purpose of the table that relationship should be stated with equal precision. The methods of classification used should also be specifically indicated especially in studies of limited scope when it is only this latter qualification that distinguishes one table from another. In such TABULATION 161 cases it is not necessary to repeat in each table the general description common to all of them. Stated briefly, if the title definitely answers the questions, what, where, when, and how, it is probably adequate. No better guide can be found for the correct and specific wording of titles than the usage in the United States Bureau of Census publications. Such titles as " Prime Movers, Motors and Generators, by Number and Rated Capacity, for Establishments Classified According to Number of Wage Earners Employed: 1929" or "Percentage of Homes Owned and Rented, by Color and Nativity of Head of Family, for the United States: 1930, 1920, 1900 and 1890" illustrate how the various phrases needed in a complicated title may be best arranged to give a clear idea of the content of the table. Headings. Every part of the table requires a heading. This in- cludes general headings for caption and stub, for each order of classifi- cation, and for each of the separate classes. Clarity and brevity are the chief considerations in wording these headings. They must be com- plete enough to serve as accurate guides to the data, although lack of space usually requires that they be worded as briefly as possible. There should be no unnecessary repetition of information that has already been given in the title. Each main heading should include whatever detail applies to all of its subheadings so that the latter which are the most crowded of all may be very short. The mechanics of arranging these headings contributes to the effectiveness of the table and will be discussed later in the chapter. In referring to the data in a table it is convenient to be able to designate the columns by number. In Figure 24 the numbers are placed at the extreme top of the column headings. As an alternative the numbers can be placed just above the line separating the headings from the data as shown in Figure 23. Sometimes the horizontal rows of a table are also numbered but this practice is less common. The numbering of columns or rows is not a requirement but merely a con- venience to be used whenever it will facilitate the description of the table in the text or reference to it in subsequent tables. The stub itself is ordinarily not numbered as a column. / Another common practice that aids in reading a table is the insertion of the unit directly over the columns to which it refers,; as illustrated in Table 19. Thus the data of certain columns are clearly desig- nated as thousands of cars, thousands (of passengers), thousands ot 162 BUSINESS STATISTICS dollars, and thousands of tons, repectively. In the index number column "Monthly average, 1923-25 = 100" is inserted in a similar position. When the headings are in the stub the units are stated along with each item or in a separate column, as in Table 51, page 289- 7 Footnotes. Anything in a table which cannot be understood by the deader from the title and headings should be explained in one or more footnotes. These footnotes should contain statements concerning figures that are missing, preliminary or revised, and explanations con- cerning any unusual figures or other features of the table that are not self-explanatory./ A study of tables appearing in print will provide multiple illustrations of the use of footnotes. Table 27, page 222, is an excellent illustration. / References. A table should always give exact reference to the source or sources from which the data were taken. Three advantages grow out of such citations: (1) The reader is given a sound basis for evaluating the data. (2) Readers who wish to obtain other data simi- lar to those appearing in the table are able to do so. (3) The author of the table insures himself against the inconsistency of source booksy For example, the data for wheat production will depend entirely upon which issue of Agricultural Statistics one happens to use. If a table contains data that have not been published previously, this fact should be stated in a note including the name of the collecting agency. In general, the use of exact references is a method of guarding against the charge of inaccuracy. Arrangement / The arrangement of a table on the page, the arrangement of data in the table and the choice of ruling, spacing, and type face contribute to the effectiveness. Fitting the Table to the Page. The limitations of the size of the printed page determine the form of tables to a large extent, hence the real problem of arrangement is to fit the table to the page so that it will be effective in that setting. One of the most important features is symmetry with respect to the margins and binding of the page. The table should be planned to read from left to right. Tables which must be read from the side of the page, tables which cover two facing pages, and tables which must be unfolded either sideways or vertically are occasionally necessary, but they should be used only when no combi- nation of smaller tables will serve as well. TABULATION 163 The proportions of the page may also determine which headings to use as stub and which as caption. In order that the height of a table may exceed the width, a one-way table is usually arranged ver- tically and in a cross-classification the longer list of items will ordi- narily appear in the stub. Length of wording of the headings is another factor to consider. It is better to use the longer wording in the stub if possible since too many words crowded into a narrow column heading are very hard to read.' When additional classifications are introduced into either the cap- tion or the stub, the new headings may be arranged in subordinate positions as in Figures 22, 23 and 24, or the table may be rearranged to make the former classification subordinate to the new one. In all such cases the question will arise of arranging columns and rows so as to emphasize significant relationships. The chief consideration in Figure 24 is comparison between 1936 and 1937, consequently the time classification has been arranged in pairs of adjacent columns. If this were not the case, time could be made a main classification leaving the types of inspection in adjacent columns, as follows: CATTLE 1936 1937 Federal City Total Federal City Total Order of Items. The three types of classification according to time, space, and attributes have been discussed. A classification in any of these categories frequently results in a large number of subdivisions or classes, which necessitates the introduction of a definite order of arrangement. Time classifications follow the natural order of the events repre- sented. It is only when the major emphasis falls on the most recent events that a reverse time order is used, i Spatial classifications sometimes follow the order of geographical proximity as when the main subdivisions of the United States are given from northeast to southwest. The New England States are fre- quently named in the order familiar to everyone, Maine, New Hamp- shire, Vermont, Massachusetts, Rhode Island, and Connecticut, but ordinarily a list of any length is most usable if arranged in alphabetical order. Size or importance is also occasionally the basis for spatial 164 BUSINESS STATISTICS classification. The method used must be determined according to the nature of the data and the reader's familiarity with the criterion selected. Attributes of quality lead to great variety of tabular arrangement. These may be listed according to importance or in some other order familiar to the reader, but again for comparability and ease in locating any given item the alphabetical arrangement is preferable. An example of time arrangement in natural order appears in the stub of Table 19, page 156, and in the same table the classification of types of freight car loadings is alphabetical. It should be noted that an item such as "miscellaneous" or "other" is placed at the end of the list in any order of arrangement. Table 20 shows three examples of other orders that may be followed. In Table 20-A the data in a spatial classification are listed according to the size of the city in which they occur. Tables 20-B and 20-C show two possible arrangements of one set of data in an attribute classifi- cation, one being in order of size and the other an arbitrary arrange- ment. In the former it is the size of the per cents, i.e., the data that are being tabulated, that is used as a basis for the arrangement of the classes, whereas in 20-A the determining size was that of the geo- graphical classes themselves rather than the amount of tax receipts. The order of 20-C is a combination of form of organization, impor- tance, and respectability. That is, commercial banks, industrial banks, and personal finance companies are privately owned corporations operated for profit. The next four are co-operative plans usually conducted on a small scale. The pawnbrokers and unlicensed lenders are the most important members of the group but are not on the same plane of respectability as the others because of the questionable busi- ness methods sometimes employed. Ruling, Spacing, and Type Face. These are devices for increasing the effectiveness of a table by concentrating emphasis on important entries and by relieving the monotonous appearance of figures in rows and columns. Whenever rulings aid the reader in understanding the classifications and subclassifications of a table they should be used. Double and triple rulings are not necessary since equal effectiveness can be achieved by using a single heavier line to separate major divi- sions in a table. , It will be observed that in Figures 23 and 24, pages 158 and 159, every column is separated from the next by rulings, but that only the main groups of rows are so separated. In many printed TABULATION 165 TABLE 20 EXAMPLES OF TABLE ARRANGEMENTS B SPATIAL DISTRIBUTION ACCORDING TO SIZE OF CHARACTERISTIC TOTAL TAX RECEIPTS OF LARGE CITIES, 1935* ATTRIBUTE DISTRIBUTION ACCORDING TO SIZE OF DATA PERCENTAGE DISTRIBUTION OF SMALL LOAN BUSINESS DONE BY VARIOUS LENDING CITY NUMBER IN ORDER OF SIZE CITY TAX RECEIP (000,000 omitted) AGENCIESt T/Yi OF LENDING AGENCY PERCENTAGE OF SMALL LOAN BUSINESS 1 New York Chicago Philadelphia Detroit Los Angeles Cleveland St. Louis Baltimore Boston Pittsburgh San Francisco Milwaukee Buffalo Washington Minneapolis New Orleans Cincinnati Newark Kansas City Seattle etc. $586 209 90 82 57 42 31 34 65 42 31 34 33 29 23 19 19 32 18 15 r T r 11 i 28.9 23.2 19.3 13.9 7.3 2.4 23 1.9 .8 100.0 Pawnbrokers a Personal finance companies Industrial banks 4 Commercial banks Credit unions 5 Remedial loan societies.... Axias 6 Employers' plans . . Total 7 8 C ATTRIBUTE DISTRIBUTION ACCORDING TO ARBITRARY ARRANGEMENT PERCENTAGE DISTRIBUTION OF SMALL LOAN BUSINESS DONE BY VARIOUS LENDING AGENCIESt 9 10 11 12 13 TYPE OF LENDING AGENCY PERCENT AGE OP SMALL LOAN BUSINESS 14 15 Commercial banks 7.3 13.9 19.3 .8 2.4 2.3 1.9 23.2 28.9 100.0 16 Personal finance companies. . Fmployers' plans 17 Credit unions 18 Remedial loan societies Axias Pawnbrokers 19 20 Unlicensed lenders ... . Total * Statistical Abstract, U. S. Department t Evans Clark, Financing the Consumer of Commerce, 1937, p. 218. (New Yoik: Harper & Bros., 1933), p. 30. tables there are no horizontal rulings but the separation between the rows and groups is accomplished by appropriate spacing and by suc- cessive indentations of items to indicate various degrees of subclassifi- cation. Bold face type, larger type, and italics are frequently used to set off totals or percentages from the other data or to emphasize im- portant items. The use of these devices is well illustrated in the tables 166 BUSINESS STATISTICS presented in the first section of each monthly issue of the Survey of Current Business. Totals What Totals To Include. A table is not complete unless it includes whatever totals and subtotals are required to summarize the data pre- sented, but this does not mean that every row and column must be totaled. A total implies that the same unit is used in all of the classes added and that the several classes taken together form a homo- geneous whole., This principle can be explained by reference to Figure 24. The four kinds of animals slaughtered, cattle, calves, hogs, and sheep, have not been added together because a total would imply, for example, that a beef carcass means the same thing as a hog carcass in terms of meat and meat products. The time classification has not been totaled, since a total made up of two years' observations would be purely arbi- trary. The emphasis in a time comparison such as this is usually on the relation of the several different periods to each other. However, in case the entire period covered in a table represents a genuine total, such as a year's production resulting from the sum of the production for 12 months, a total for the year would have significance. On the other hand a total for each kind of animal by inspecting agency is included because the classification represents the separation of a whole into comparable parts and because the total is a production figure of value for comparisons in the other classifications. Likewise three sets of totals have been computed vertically: (l) the subtotal for each company, (2) the subtotal of each grade of meat, and (3) the total of both grades of meat for all three companies. It will be noted that the selection of subclassification in the stub of Figure 24 has resulted in bringing together in the last section the subtotals of each grade of meat, whereas the subtotals for each company are sepa- rated from each other. Comparisons of the latter can easily be made since the company subtotals and the total of all companies are printed in italics. The preceding examples dealt entirely with classifications in which the units were counted objects and included totals wherever pertinent. In classifications of counted objects which are not parts of a total, sometimes an adjusted total may be used, but in general no total should be included. Distinct from these are classifications of rates or prices TABULATION 167 as Table 15, page 149, in which averages should take the place of totals whenever a summary figure is required to make the table complete. A derived table may include totals, averages, or neither, according to the nature of the data and the purpose for which it is constructed. ' Position of Totals. The natural sequence of reading is properly accommodated by placing totals at the foot of the columns and at the right of the rows. Statistical practice may reverse this, placing totals at the top and left when they are more important than the individual items. Both positions are in common use. This question must be de- cided for each table in terms of what the maker believes will best serve his purpose. Briefly, the usage is: totals at the top and left or at the bottom and right; but not at the top and right, nor at the bottom and left. Subtotals follow the same usage in any given table, that is, if the totals are at the top and left of the table the subtotals will also appear at the top and left of the items from which they are computed. Significant Figures There are two aspects of the subject of significant figures in a table. The first relates to the number of significant figures which need to be retained for accuracy, while the second relates to a total and its parts. f Number of Figures Retained. Tables are often unnecessarily en- cumbered by retaining all of the digits in numbers running to millions or even billions. A good example of this is Table 18, page 154, in which too many digits have been retained purposely for discussion at this point. It was stated in chapter II 2 that four significant figures are ade- quate for statistical work. The unit for telephones could therefore be changed to 10,000 telephones, but customarily units are used in thou- sands, millions, or billions. In accordance with this practice the number of telephones should be expressed in thousands, making the figures accurate to five digits. The column showing the number of toll calls should be expressed in units of one million calls accurate to one deci- mal place, giving four significant figures. The entire table has been reproduced in corrected form as Table 21. The revised form is more effective and sufficiently accurate for statistical purposes. f The general rule is, retain only four or at most six significant figures in the tabular presentation of data, but in all cases indicate the size of the unit used either by showing the number of digits omitted from 168 BUSINESS STATISTICS TABLE 21 REVISED FORM OF TABLE 18 YEA (l) NUMBER OF TELEPHONES (000 omitted) (2) NUMBER o (000,000 (3) F MESSAGES omitted) (4) Oi ( (5) DERATING iNCOa 000,000 omitted (6) [E ) LOCAL TOLL AND LONG DISTANCE LOCAL SERVICE TOLL SERVICE TOTAL 1931 15,390 13,793 13,163 13,378 13,845 14.454 22,705 21,526 20,148 20,677 21,465 22.870 985.5 823.9 747.2 781.8 830.7 911.3 $723.9 670.7 617.3 607.7 641.0 665.2 $326.3 263.1 243.9 258.7 273.5 306.2 $1,050.2 933.9 861.2 866.4 914.5 971.4 1932 1933 1934 1935 1936 each number or by using the expressions "in thousands" or "in mil- lions," etc. Note that in column 3 "(000,000 omitted)" refers to the number of digits dropped between the original decimal place and the newly established one, regardless of the fact that the data as written are accurate to hundred thousands, , Rounding Off Totals. A different question arises when the table consists of a total and its parts. The entire table should be made up from the original data and each item then rounded off separately, as the data in Table 21 were rounded off from Table 18. As a result the sum of the individual items as they appear in the rounded-off table may be either greater or less than the rounded-off total of the original data. One such instance may be seen in Table 21. For 1932 the total operating income, column 6, does not correspond exactly to the sum of its parts in columns 4 and 5, although reference to Table 18 will show that no error has been made either in the addition or in rounding off any of the three figures. The percentage distribution is a particular case of the part to total relation which requires further explanation. For example, in Table 22 the percentage distribution of the interest-bearing debt should add to 100 per cent whether it is carried to one decimal place or two, because the total debt is being distributed and the sum of the parts must be equal to the whole. However, the exact sum of column 2 is 99.99 and of column 3 is 100.1. The discrepancy arises from rounding off the last decimal place in the computations. No theoretical question is in- volved but merely the practical one of the best method of expressing the total in the table. Since the total is exactly 100 per cent, in any such case of apparent discrepancy it should be written to one less si#- TABULATION 169 TABLE 22 THE INTEREST-BEARING DEBT OF THE UNITED STATES TREASURY ON APRIL 30, 1937, ACCORDING 10 TYPE OF OBLIGATION, AMOUNTS AND PERCENTAGE DISTRIBUTION* TYPE OF OBLIGATION A (1) AMOUNT OUTSTANDING (000.000 omitted) (2) PERCENTAGE DISTRIBUTION (two decimals) (3) PERCENTAGE DISTRIBUTION (one decimal) General bonds 20 133 7 58 70 58.7 U. S. savings bonds 755 5 2 20 2.2 Adjusted service bonds 4096 1 19 1 2 Treasury notes 10 377 4 302^ 30.3 Certificates of indebtedness 2687 .78 .8 Treasury bills 2 353 2 6.86 6.9 Total 34.298.1 100.0 100. * Statement of the Public Debt of the United States, April 30, 1937, Treasury Department. nificant figure than appears in the individual per cents. Thus the total of the first percentage distribution has been written 100.0 and the second 100. Checking Any possible errors in a table should be guarded against by com- plete checking. There may be errors in the numerical content due to mistakes in addition or other computations, or the validity of the table as a whole may be impaired by errors in judgment with regard to the general plan, items included, or details of wording and arrangement. Accuracy of Numerical Content. Checking of computations should be a routine step in table construction. Whenever any part is totaled the addition must be checked, preferably on a separate outline form. If there are horizontal totals and vertical totals as in Table 17, page 153, the grand total (25,209) that is common to the two sets of totals must be checked in both directions. If the form is similar to Figure 24, page 159, all parts of each subtotal must be checked as well as the grand total. It cannot safely be assumed that if the totals of a table check in one direction or if both sets of subtotals check with the grand total there are no mistakes in the figures. Similar precautions must be taken in the case of every computation in every step of table prepara- tion, especially apparently simple operations that are done mentally such as multiplying by 2 or dividing by 25. Each step should be checked before the next one is taken and corrections of errors should be rechecked. Validity. Errors of judgment may occur in the planning of tables or during their construction. Checking by an experienced person who 170 BUSINESS STATISTICS has an adequate knowledge of the background of the subject as well as experience in table construction is required if such errors are to be avoided. The types of errors that could have been corrected by careful checking are illustrated in Table 23. TABLE 23 LIVE-STOCK EXPORTATION, 1929-32 (1,000 head) YEAI l 928 IS 50 l< >31 1! >32 PER CENT PER CENT PER CENT PER CENT Hogs, live 1 279 88 654 52 355 22 179 13 Hogs, butchered 47 3 117 9 191 11 16 1 Bacon 116 8 410 32 920 56 1,008 74 Ham and other products, pieces. . 8 1 76 7 179 11 160 12 Total 1,450 100 1,257 100 1,645 100 1,363 100 The title fails to name the country exporting. It indicates all kinds of livestock but the only live animal included in the table is hogs. Butchered hogs can scarcely be called livestock and certainly bacon and hams are not. The title names the years 1929-32 which might mean either annually or a total for the period. However, the headings do not agree with either, even disregarding the obvious typographical error in printing 1950. Since there are at least three different units used in the table head of hogs, pieces of ham, and bacon in some unnamed unit the unit "1,000 head" should not have been stated along with the title. It should have read "All figures in thousands." The caption heading "Year" is misplaced and the stub has no heading. The caption subheadings should read "Number" and "Percentage Dis- tribution." However, the most fundamental error is in the presenta- tion of totals and percentage distributions for items that are expressed in non-comparable units and that in no sense make up a total having any meaning. ,j TABULAR FORMS In the conduct of routine statistical work the tabulation of data becomes a continuous process of recording the same types of data daily, weekly, monthly, or annually as the case may be. For this pur- pose the preparation of standard forms in quantity saves time and promotes uniformity of records. These forms must be carefully planned and drawn up, hence they are excellent illustrations of the TABULATION 171 use of the principles of tabulation. Frequently special adaptations must be made to facilitate the recording of particular types of data, a circumstance which increases the desirability of studying such forms. On succeeding pages selected forms used by government agencies and by a business concern are presented with brief discussion. Government Forms Three forms used in the price section of the crop reporting service of the Department of Agriculture are presented as examples of the blanks prepared for recording external data. Figure 25 is used principally for summarizing the data on prices received by farmers by months and by commodities. The printed headings indi- cate the following: Column 1, weights for computing weighted average prices for the United States ; column 2, straight or unweighted average prices reported by correspondents in each State; column 3, average prices reported by corre- FIGURE 25 DEPARTMENT OF AGRICULTURE FORM C. E. 1-128 (C K 1.1WI t STATF* Mnmo >ATE PRICES RECEIVED BY FARMERS COMMODITY DATE PRICE PER E T S K- EN 10 H PRICE PER E) T S c- EN or 4 n = Vt G1 _ - 5T W w ST AV D SI Al H NA L i 5T W W( i JTt XV ) 51 FAT Fl NA L Nil - Vt -- - - Mint K I Com. N K NC. NY - - -^ ~ ^. _ =-_ - NJ--- - - ^ Ariz - - - - Utah - Nov. MOUNT Wash. - - - Orv B ~ - - - Calif PACIFIC US. - - ~ - 172 BUSINESS STATISTICS FIGURE 26 DEPARTMENT OF AGRICULTURE LONG-TERM BLANK (COMMODITY) : AVERAGE PRICE F RS T (STATE) 'PR (UNIT) (RECEIVED OR PAID) BY | -ARME (FROM) | 9 (TO) I9 YEAR JAN. FEB. MAR APR MAY JUNE JULY AUG. SEPT. OCT. NOV. DEC. AV ^^ 1 \ r\ ^ ^^\ ^-^ ^-^ f^ " ^1 ^= ^- , -^ \^: =^ ^^- \ "--^__ Source: Bureau of Agricultural Economics, Crop Reporting Board. spondents as weighted by price reporting districts ; column 4, price recommended for the commodity by the State Statistician; column 5, price adopted by the Crop Reporting Board after review of all the available data, including the original price listing sheets, market price reports and other check data. Column 6, headed "Extension," provides space for recording extensions of price times weight com- puted in averaging the adopted State prices to obtain United States and division averages. Since there are thirteen columns on this sheet it is possible to include the record for two months on one sheet. This forms our primary record of monthly prices. Figure 26 is designed to make possible the summarization of monthly prices for a number of years on one sheet. One or more of these sheets carries the complete record of monthly prices for each commodity, by States, including weighted annual averages. It is our practice to bind these sheets for all States together in books, one for each commodity. Figure 27 is provided for summarizing the monthly prices in another way, with prices for all commodities for one State listed, three years to one TABULATION 173 FIGURE 27 DEPARTMENT OF AGRICULTURE FORM C. E. 1-1)9 8 - Li % - 3 C 1 ^ u> (i,^ e 3 3 * ? 5 ", " I rc-S l|s W "0 ii* . , i m *.j 8 1 ft sl r| 5 s4 r 5 " 3 & 5 si ., -3 N a n" 2 il ^ 5 Q^j 3 3 3 3 y 3 N ^ g t. ? 11 Ul fc'S M < & - ? " S 3 . a 5-, 1 i 5 o> A tS* u a 5 PU i 55 ^ t ** ^ "1 5 S s "31 > if " CM ''5 2 fe >- ew -* i. W n g ^ a # i! n i B i E tttf. < w PH a 3 1 fe w o H W h& i 5 n ^3 til - i % * w 1? ^ u ^ R a LI o o w ^ S fc fc a N gg ^4 E 5 H t| ? ^ K I S'a 1 s i %5 '=?- a " S & 2 h ^ g i i =^^ g i ^ - 3 w o 5 pt 1 - 9 s |1 ^; ^ e | j g 2|3 5 p3 s* U-5J ID BS 8 '1 1 ** i = 2 fi|J pa w g Z 3 s 4 e | S S^ .Sa 3i "| i W co 3 : w y HI a *~ a jjl ^ CJ Si " I ""a V) 5 0* 7 3 o 1 ~ l "il " 1 -a 3 174 BUSINESS STATISTICS sheet. These forms are also bound together in books in which all States, geo- graphic divisions and the United States are represented. 8 These forms are so planned that they can be used for several dif- ferent kinds of tabulations. For example, Figure 27 is also used to record the individual crop reports from which the state averages of column 2 of Figure 25 are computed. Business Forms The routine tabulation by business concerns of data concerning their own operations differs from the record keeping of government agencies mainly in the type of data tabulated. This difference leads to the use of forms which are quite distinct from those previously presented, and which are specialized to meet the needs of the par- ticular concern using them. It follows that as many forms could be presented as there are business concerns, but those used by a single concern will illustrate some important uses and adaptations./ The forms shown on succeeding pages are used in the routine statis- tical work of the Eastman Kodak Company of Rochester, New York. 4 This company divides the year into 13 four- week periods, hence all of the forms which follow are filled out 13 times each year. Figure 28. The "Comparison of Sales by Divisions is a form prepared primarily for the use of executives. It is the usual type of two-way table intended to be read in either direction, i.e., the per- centage of increase or decrease of sales of any of the products can be compared from one division to another or the percentage of increase or decrease of sales of different products in any division can be com- pared. The percentage of increase or decrease of sales by divisions in next to the last column is not obtained from the preceding columns but by comparing sales in dollars with the corresponding figure for the preceding year. These percentages can be compared with the per- centage of change in bank debits given in the last column. 5 The prep- aration of the percentage of change of bank debits requires considerable work because the divisions of the country used by Eastman Kodak The explanation of the use of these forms was supplied by Mr. Roger F. Hale, Agri- cultural Statistician, Division of Crop and Livestock Estimates, Bureau of Agricultural Economics, United States Department of Agriculture, Washington, D. C. 4 They arc presented with permission of the Eastman Kodak Company and are made available through the courtesy of Mr. A. H. Robinson, Assistant-Treasurer. 5 The use of bank debits as an indicator of business activity is discussed in chapter XX. TABULATION 175 FIGURE 28 COMPARISON OF SALES BY DIVISIONS PERCENTAGE CHANGE FROM PREVIOUS YEAR Black figures are increases, red figures are decreases DIVISION AMATEUR CAMERAS AMATEUR FILM CK PRODUCTS PROF FILM & PLATES PAPERS CHEMICALS XRAY& DENT FILM TOTAL* BANK DEBITS A r B C D E F G TOTAL US * Excludes sales not common to all divisions Company as shown on the map do not coincide with Federal Reserve Districts for which the percentages of change in bank debits are avail- able in print. Features of this form are the ruling, the use of black and red ink to distinguish increases and decreases and the exclusion from the total column of sales not common to all divisions. For example, if there were some divisions in which "chemicals" were not sold, the sales of this product would be excluded from all divisions in computing the percentage of change of total sales. Figure 29. The line above the double ruling might read "Re- port for Frfth Period Ending May 20, 1939." The record of fa o 5 s il4 h c/3 O ill! w 2 ~ P Lost Time (Excl. Lost Time (Excl. all I! 5-3 f 1 ft !l J 1 8 o fl , g sis sli2 8e g ? | H< J I , s! * 1 in 0. e Ill jl 2 | 5 5 j . !i M 1 t ? fe i H 9 i|f 1 ijjf || " .h 11 i* - * ; I i - 5 i I t" ^ g5 J i i u 3 1 I a g * iS If 1 S ^ ^ 'T S g| i s jj S ! ! 1 Grand Total Lost Time Total Lost Time Not Paid (Incl. in above Grand T< Total Lost Time Paid For (Incl. in above Grand T< Total No Lost or Overt inn (Average for Period) u ill! | S2 i I 1- ^ S H ATS j% +3 i - +3 d aT p*3 ^ 9 S .2 S a SR o3 *? "" g o 13 g o ! 1 la a ^ fl W PH ' M f~\ % 3 * ^ S Q *p t ^ fl 3 HPi o d M a . oi 8 <3 2 1 S w 1 fl rt | 3 g-Sj^S 1 2 " o 'g s > g 5 3 g g 1 1/2 S S P, 'o c3 0- s^ d 3 I S H 1 & S j?o |^1 "S ^"fl a "2 o 4-3 p EPOR o| f j fill o "2 '3 3 t| 3S o -^ I S H (4 M 3 1 tf feo ^s.^S -40 a. .0 J & .s J8 <T- ^ ,0 FIGURE 30 INSTRUCTIONS ON THE REVERSE SIDE OF LOST TIMI >rtablc Employes 1 In "Summary of Total Employes" the figures on the total for Manaj part mental Superintendents and Assistant Superintendents, and Main be indicated in a separate item. ) The "Average for Period" figures are averages of the four figures for tl week of the period, i General L'nder the section on "General Employes" are to be reporte or O\ertime" basis. The total "General Employes' 1 figures should be t ) No Lost or O\crtime Under the section on "No Lost or Overtime Err ployes except "General Employes " Exclude the Manager, Genera Superintendents and Assistant Superintendent^, and Mam Ofiice Depai or Overtime Employes" figures should be the average for the period. bsences of one hour or more arc to be reported. Absences for any cause cxce period exceeding 26 weeks for any one employe. nces A\ith Permission (Code No. 1) 1 absences with permission other than for slack work, illness, accidents, vaci excess o\cr 40 hours should be indicated in one item for both "General Em aployes." nces without Permission (Code No. 2) elude all time lost without permission and when no explanation is given. : Work absence (Code No. 3) should be indicated for both "General" and cportable absences on account of illness (Code No. 4) arc to be reported to both "General" and "No Lost or Overtime Employes"). > 8 houis or lobs. ) More than 8 hours, but not exceeding 40 hours. l More than 40 hours, but not exceeding 26 weeks. nces on account of accident and injuries (Codes No. 5 and No. 6) should ) Employe has returned to \\ ork. ) Employe has left employ of Company. 1 It has been decided to be a case of permanent disability. .tions Paid For (Code No. 7) ider this heading include all time granted for annual vacations which is paic both "General" and "No Lost or Overtime Employes." nee for discipline (Code No. S) should be indicated for both "General" and nee for "Excess Time Over 40 Hours" (Code No. 9) should be indicated /ertiine Employes." O, rt Q O "O d d w5 ^ M ^ fl o *" OT c3 JO o %3 cS>3 cj M 55 D.2 80 ~ ~ ~ 3.2 < 3~ 1 U .0 cj Mu * j2 X) M 1-H CO 00 d lllillHlh 180 BUSINESS STATISTICS total employment is placed at the top of the sheet but in the re- mainder of the form the staff is divided into "General Employees" and "No Lost or Overtime Employees." The reasons for lost time at the left give a detailed view of what caused employees to be absent from work and the length of absence. A summary of actual hours worked, hourly rate of earnings, and payroll is included for each of the two types of employment. The features of this form are the spacing, variation in type, judicious use of ruling, and the inclusion of a transfer code. Figure 30. A person unfamiliar with the organization and opera- tions of the Eastman Kodak Company would experience considerable difficulty in filling out the form of Figure 29 because of the technical use of terms and the need for explanation of the method of computing averages and other summary figures. But variations in usage would also occur among those familiar with the form, if uniform interpreta- tions of terms and computations were not provided, hence the instruc- tions for filling out the form are printed on the back. These have been reproduced as Figure 30. The explanations are written for the guidance of persons thoroughly familiar with the way the company is organized and operated; there- fore much is omitted that would have to be stated in the instructions accompanying a similar form used in external statistical work. For example, the definition of reportable employees appears to be ambig- uous in stating that "General Employees" excludes "No Lost or Overtime Employees" while "No Lost or Overtime Employees" ex- cludes "General Employees." However, the persons who use this form understand that "General Employees" are those who work on a piece rate or hourly rate basis, and "No Lost or Overtime Employees" are those who work on a fixed weekly wage basis. Hence the instructions concerning reportable employees mean that regardless of the type of work performed by a particular individual during a given payroll period he is to be reported according to his permanent status either as a time worker or a salaried worker. Figure 31. The Labor Turnover Report like the Lost-Time Report is filled out for each works and each four-week period. The form is largely self-explanatory although it calls for considerable detail. The report requires information on three subjects: (1) the number em- ployed, (2) the number entering, and (3) the number leaving, but the emphasis is placed on an analysis of the exits. The primary sum- TABULATION 181 FIGURE 31 LABOR TURNOVER REPORT H tf O M > 20 cupauonal cause e (Femalel 3. Is 1H 5-S S-S SS-S I? * "-2 3S giSfill^ll 3fi.sc 3 Q .^S H ifS'<-)'i<OO^OOOO-H Mf^r^^iio fN ro -f ^H f>| r*i ^< 10 ^ vH ^,^,^i r 4^ l ^^i ( s4(si()or>t^o^<<^i^ l -<}iiomioinv> 4 f) ^ 10 a! o CO X T ta Number of Employees First Dav of Period . r "o 'I I a. J 'I \ Entrances FmnlnvfH : l ] 1 h i u i jl i! c i j 1 T 1 1 t ! $ i I Deduct "Transferred ' an "Unavoidable" B2 A IN O 182 BUSINESS STATISTICS mary figure is the "Net Turnover Per Cent" obtained by dividing "Net Number Leaving" by the "Average Number of Employees for the Period." The percentage distribution of "Total Exits" according to length of service provides information concerning the "employee plant age" at which employment is most likely to be terminated. The analysis of reasons for leaving, on the right-hand side of the form, is intended to provide detailed information concerning the under- lying causes of labor turnover. Continuous study of these reports aids management in two ways: (l) the personnel manager secures a back- ground of knowledge of what types of persons are most likely to become permanent employees and (2) over-all management is able to detect unsatisfactory conditions that are causing a large separation ratio in any part of the organization. Conclusion The forms appearing in this section are not intended to be repre- sentative of all the prepared tables used in routine statistical work. The purpose is merely to present a few examples to show how the principles of tabulation are employed in practical work. The outstand- ing feature of all of the forms is the extent to which arrangement, ruling, spacing, and content have been co-ordinated to emphasize the major results contained. These forms, of course, are not intended for publication but are prepared for use by persons thoroughly familiar with their contents and purposes. Consequently many things which have required explana- tion in presenting the forms in this book are commonplace to those who use the forms regularly. This difference in background leads to a general observation of some importance to the budding statistician. Study of the principles and methods of statistics provides a sound basis for engaging in work of the type involved in preparing forms such as these examples, but general knowledge must be supplemented by par- ticular training to produce a practicing statistician. PROBLEMS 1. What type of classification is employed in each of the following: 6 The use of a percentage distribution of separations in the measurement of labor turnover is developed in chapter XII. TABULATION 183 COLOR OF HAIR NUMBKK OF STUDENTS IN CLASS Light 8 Red 3 Brown 7 Black 2 CITY BANK DEBITS 1938 (Millions of Dollars) Boston . . 14,288 New York 168 778 Philadelphia 14,553 etc. SIZE OF CITY No. OF DWELLING UNITS CONSTRUCT PER 10,000 POPULATION IK 1940 500,000 and over. . . 100,000 to 500,000. 50,000 to 100,000. 25,000 to 50,000. 10,000 to 25,000. 5,000 to 10,000. 2,500 to 5,000. All urban 486 56.8 485 67.4 68.7 67.3 64.4 57 5 SHIPMENTS OF FINISHED STEEL BY YEAR UNITED STATES STEFL CORPORATION (1,000 net tons) 1937 14,098 1938 7 316 1939 11 707 1940 14976 2. From recent issues of any business periodical find three one-way tables, each of which illustrates a classification according to a different kind of characteristic. Copy enough of the table to indicate the kind of classifica- tion. Give exact and complete references to the sources used. 3. What are the distinguishing characteristics of primary tables and derivative tables? 4. Which of the tables of problem 1 are primary and which are derivative? 5. In Table 17, page 153, what information is primary and what is derivative? 6. The following statistics have been published for the Bell Telephone System: "The number of manual service telephones declined from 10,705,118 at the end of 1930 to 9,659,349 at the end of 1931 while the dial service tele- phones increased in the same period from 4,976,941 to 5,730,645. This is a net decline in number of telephones of 292,065. The average number of telephone calls per day in 1930 was 62,365,000 local and 2,933,000 toll; in 1931 these had declined to 62,205,000 and 2,700,000 respectively, a total decline of 393,000 calls. The miles of wire in underground cable were 50,225,000 at the end of 1930, the miles of aerial cable were 20,- 785,000. At the end of 1931 there were 52,214,000 miles of underground cable and 21,951,000 miles of aerial cable. There were 5,238,000 miles of open wire at the end of 1930 and 5,074,000 miles a year later." Present this information in tabular form, taking account of all the points of established practice in table construction. Does your table have unity? Why or why not? What is the degree of complexity? Explain. 184 BUSINESS STATISTICS 7. List the separate classifications present in Figure 24, page 159, and state the characteristic with respect to which each classification is arranged. 8. Study the headings of tables in any year's Supplement to the Survey of Cur- rent Business. Write a paragraph on the use of table headings based on your findings. Give specific references. 9. a) Consult Table 2, page 1, in any Statistical Abstract from 1932 to 1940. (1) Discuss the location of totals. (2) Discuss the arrangement of stub items. b) In the Statistical Abstract consult either Table 494, page 432, 1939; Table 472, page 408, 1938; Table 464, page 401, 1937; or Table 460, page 401, 1936. (1) Describe the method of classification and degree of complexity. (2) How many different units are there? Name them. Do you think it is justifiable to include all of them in one table? Why or why not? (3) Discuss any desirable or undesirable features in the table 10. Describe in detail the order of arrangement of each of the following tables: DEATHS FROM CHIEF CAUSES IN THE UNITED STATES, 1935 DISEASE No. OP DEATHS Heart 312,333 Cancer 144,065 Nephritis 103,516 Pneumonia 100,279 Accidents 99,967 etc. B BANK CENSUS, 1935 TOTAL STATE No. OF No. OF BANKS EMPLOYEES WAGES Alabama . . 251 2,123 $ 3,227,296 Arizona . . 39 492 848,587 Arkansas . 260 1,416 1,905,105 California . 1,083 19,523 38,675,923 etc. RETAIL TRADE IN THE UNITED STATES, 1935 APPAREL GROUP TYPE OF MERCHANDISE Men's furnishings Men's clothing Family clothing Women's ready-to-wear Furriers and fur shops Millinery stores Custom tailors Accessories and other apparel. Shoe stores SALES (in millions) $516 144 359 795 60 94 67 110 511 D VALUE OF PUBLIC BUILDINGS ERECTED IN CITIES IN NEW YORK STATE IN 1936 CITY VALUE OF THE CONSTRUCTION (in thousands) Buffalo $ 21 Rochester 1,491 Syracuse 1,108 Yonkcrs 60 Albany . 95 Utica 17 etc. TABULATION 185 REFERENCES Bureau of Agricultural Economics, United States Department of Agriculture, The Preparation of Statistical Tables, A Handbook. Washington, D. C, December, 1937. A statement of the rules for table construction employed by a government bureau. DAY, EDMUND E., "Standardization of the Construction of Statistical Tables," Quarterly Publications of the American Statistical Association, Vol. XVII, No. 129 (March, 1920), pp. 59-66. A brief but complete statement of the principles of tabulation. DAY, EDMUND E., Statistical Analysis. New York: The Macmillan Co., 1925. Chapters IV and V contain a detailed discussion of classification of ob- servations. MUDGETT, BRUCE D., Statistical Tables and Graphs. Boston: Houghton Mifflin Co., 1930. Chapters I, II, and III contain a very lucid explanation of the principles of classification and tabulation. WALKER, HELEN M., and DUROST, WALTER N., Statistical Tables, Their Struc- ture and Use. New York: Bureau of Publications, Teachers College, Colum- bia University, 1936. A detailed discussion of the mechanics of table construction and the analysis of tabular material (examples from field of education). CHAPTER IX CLASSIFICATION OF LIBRARY SOURCES THE MEANING OF COLLECTION FROM LIBRARY SOURCES CHAPTERS IV to VIII have been devoted to the methods of securing data by direct investigation. Chapters IX and X will explain the procedures used in finding data that have already been collected and published. The discussion is introduced at this point because subsequent chapters deal with the steps of analysis which are applicable to data collected either directly or from library sources. "Library" is used as a general term descriptive of all published sources of business data. Some of the publications which are available to students only in public or school libraries may be kept on file cur- rently by an individual business concern, but the industrial statistician is also likely to be dependent upon libraries for long-time series or other than ordinary data. His method of procedure will not differ materially from that of the student in the search for published data needed in a given problem. Published sources of business data are for the most part current periodicals or yearbooks. How to become familiar with the contents of these publications presents a very real problem. A reference list of such sources and of their contents is of only temporary value since they are subject to constant changes. New publications may appear and older ones disappear; new series are added, older series are dis- continued, and the form of recording is altered. Consequently, in this chapter the emphasis is placed on various classifications of library sources, but no attempt is made to provide a complete list 1 of reference material. The next chapter will deal with the difficulties which may be encountered in securing data from these publications. METHODS OF CLASSIFYING SOURCES Published sources of data may be classified in a number of different ways according to the point of view from which a problem is ap- 1 A selected list of sources is given in Appendix A at the end of the chapter. These sources are numbered consecutively and whenever one of them is mentioned in the text reference is made by number to the detailed description in the appendix. 186 CLASSIFICATION OF LIBRARY SOURCES 187 preached. Five methods of classification are discussed in the succeeding pages: (1) types of data contained, (2) form of publication, (3) frequency of publication, (4) regularity of publication, and (5) pub- lishing agency. These are not all equally important but all of them must be taken into account in acquiring familiarity with sources. Types of Data Contained Classification according to type of data is the least important method from a practical point of view, since it is applicable to only a few sources. All source books of business statistics are concerned with economic data, but as a rule they are not confined to a single phase such as production of raw materials, manufacturing, or marketing. Agricultural Statistics? the Census of Manufactures* and the Market Data Handbooks* might be named as representative of these three specific phases of the economic structure. A few other sources of limited scope are: Statistics of Railways* (transportation) and Chain Store Age 6 (a specialized type of retail trade). By far the greater number of source books deal with several or all of the functions in our economic system. Examples are: the Survey of Current Business? Standard Trade and Securities Statistical Bulletin* and the Commer- cial and Financial Chronicle. 9 Some of the source books may be classified according to their scope in other respects, for instance, geographically. Many of them contain data for the entire United States; others are confined to a particular state, city, or local area. Still others present world data or data for several different countries. Frequently a single source will cover ter- ritorial or governmental subdivisions at various levels. For example, Agricultural Statistics 2 is mainly devoted to a complete compilation of data concerning agriculture in the United States with some information for other countries. In many tables, however, production statistics are subdivided by states, and price and market data by individual cities or regions. This source then is international, national, state, and local in scope, with the major emphasis on the national data. 2 Appendix A, No. 23. 8 Appendix A, No. 12. 4 Appendix A, No. 6. 5 Appendix A, No. 42. 6 Appendix A, No. 72. 7 Appendix A, No. 1. Appendix A, No. 59. 9 Appendix A. No. 54. 188 BUSINESS STATISTICS It can be concluded, therefore, that the majority of sources cover so wide a range of information that they cannot be classified according to specific types of data. This method of classification of source mate- rial, although theoretically sound, is not usable in the search for data. Form of Publication Statistical Source Books. Some publications consist almost entirely of statistical tables. Either the index or the table of contents can be used to find the data pertaining to a particular subject. Most publica- tions of this kind come from governmental agencies, although in recent years there has been a great increase in the amount of such work done by private organizations. Examples of the latter are the Standard Trade and Securities Statistical Bulletin, Automobile Facts and Figures, 11 and A Review of Railway Operations. Auxiliary Sources.- Sources which contain data as an auxiliary to other functions are more difficult to use. Data may be scattered through the book or magazine in conjunction with articles to which they apply. This is the case with the Commercial and Financial Chronicle, 1 * Business Week, 14 and Commerce Reports. Other auxil- iary sources such as Dun's Review and the Northwestern Miller 11 group most of their data in one section. In the great majority of such publications the data appear in tabular form. The tables have proper titles and whatever footnotes are neces- sary to explain any irregularities of the information. There are, however, a few cases in which valuable data are printed as text mate- rial. Careful attention is necessary in order to detect data published in this form and caution must be exercised in using them because necessary explanations may be far removed from the place in the text at which the data are found. It would be advantageous to all con- cerned if this practice were discontinued, but as long as it persists statisticians must be prepared to search for information appearing in that form. Newspapers avoid the use of tables with some regularity 10 Appendix A, No. 59. 11 Appendix A, No. 50. 12 Published annually by the Association of American Railroads, Bureau of Railway Fconomics, Washington, D. C. 18 Appendix A, No. 54. 14 Appendix A, No. 55. 18 Published weekly by the United States Department of Commerce, Bureau of Foreign and Domestic Commerce. "Appendix A, No. 57. 17 Appendix A, No. 66. CLASSIFICATION OF LIBRARY SOURCES 189 because of the difficulty of adjusting them to narrow columns. The following paragraph illustrates the need for a table. Then, France is a country of handicraftsmen ; even after recent and important evolutions, such as the reconstruction of the northern departments and the return of Alsace-Lorraine, it has not decidedly become a country of great industry. Out of 21,721,000 people given to active occupations, only 6,181,000 28% belong to the industries of transformation. Among those 6,181,000 only 4,027,- 000 65% are regular industrial workers; 683,000 11% are employers, which shows the great number in France of small employers; 1,162,000 18% work alone independently or are not regularly connected with employers. Out of 4,000,000 of regular workingmen, only 774,000 19% are employed in factories of more than 500 workers! The conclusion is that France is a country of artisans; the village joiner, the motor-car mechanic, the couturiere even in the great maison de couture, the vine grower, the gardener who raises vegetables or fruits, all belong to that type of workers, and they are undoubtedly the most typical of the French people. 18 Variability of Content. There is a variation in the form of pub- lishing data which applies to sources devoted entirely to statistical information as well as to those that publish such information inci- dentally. The most convenient publications to use are those which contain the same series of data in each issue, such as the monthly Survey of Current Business, the Monthly Labor Review, and the Standard Trade and Securities Statistical Bulletin. 21 On the other hand Crops and Markets 2 * and Steel 23 present whatever monthly or weekly data are available at the time of publication. Frequency of Publication In discussing this classification we will proceed from those sources which appear most frequently to those which have longer intervals between publication dates. Daily. Daily papers then are the first source on the list. The financial section of the paper contains information on a variety of sub- jects. The great virtue of the daily paper is its ability to place data in the hands of its readers quickly. The element of speed tends to 18 Andre Siegfried, "French Industry and Mass Production," Harvard Business R* view, Vol. VI, No. 1 (October, 1927), p. 2. "Appendix A, No. 1. 20 Appendix A, No. 16. 21 Appendix A, No. 59. 22 Appendix A, No. 24. 23 Appendix A, No. 62. 190 BUSINESS STATISTICS reduce accuracy; hence the data found in daily papers are sometimes not reliable and for this reason they should be verified in other sources as soon as possible. There are a number of daily publications which are valuable sources because they deal with particular subjects. Among these the Wall Street Journal?* the New York Journal of Commerce and the Amer- ican Metal Marked are typical. There are also many daily reports issued by government agencies, such as the daily Treasury Statement? 1 and daily produce market reports issued by state departments of agriculture. Weekly. There is an increasing tendency toward weekly compila- tion and publication of data to meet the demands of business men for information as nearly current as possible. The demand is further evi- dence of the extent to which numerical facts have become useful in determining business policy. Accordingly, the Bureau of Labor Statis- tics now computes its Index of Wholesale Prices weekly. Data such as car loadings, bank debits, and electric power production are available weekly. Among weekly publications the Commercial and Financial Chronicle?* the Weekly Supplement to the Survey of Current Busi- ness 29 and Iron Age* are widely used. Monthly. Monthly publications also attempt to put information in the hands of users as soon as possible. Sometimes the data for one month are available as early as the 10th of the following month. More commonly the data are one or even two months old before they appear in print. Some important monthly publications are the Federal Reserve Bulletin* 1 the Survey of Current Business?* and bank reports such as the Business Bulletin of the Cleveland Trust Company. 82 Annually. Other valuable sources appear annually. Most impor- tant of these are the yearbooks which contain a great amount of basic data with some series running back for long periods. Yearbooks require much preparation, consequently the data may be several months or even a year old before the book is published. Among the valuable 14 Appendix A, No. 58. 35 Journal of Commerce Corporation, New York. 26 American Metal Market Company, New York. 27 United States Treasury Department, published in daily newspapers. "Appendix A, No. 54. 29 Appendix A, No. 1. 80 Appendix A, No. 61. 31 Appendix A, No. 34. M Cleveland Trust Company, Cleveland, Ohio. CLASSIFICATION OF LIBRARY SOURCES 191 yearbooks the Statistical Abstract* and Agricultural Statistic^ may be mentioned. Several newspapers in various parts of the country publish yearly almanacs which contain a large amount of statistical data. These are not usually considered to be authoritative source books but they serve as convenient guides to data which may be found elsewhere. Longer Intervals. Examples of sources appearing less frequently are the volumes of the Census of Population which are published at ten-year intervals, the Census of Agriculture** which has been pub- lished along with the Census of Population since I860 and quinquen- nially since 1925, and the Census of Manufactures* 1 which has been published along with the Census of Population since 1850, was also published in 1905 and 1914 and has appeared biennially since 1919. Special Releases. Recognizing the necessity of saving time, many of the government bureaus release their more important data as soon as possible. In some instances the data contained in these releases may be preliminary and may subsequently be revised in a regular publica- tion. In other cases the data are not reprinted at any time. The Bureau of Mines has adopted the practice of issuing each chapter of its Year- book** separately in advance of the complete bound volume. The Bureau of Labor Statistics distributes special processed bulletins on wholesale and retail prices, cost of living, and employment and pay- rolls, but the greater part of this material is reproduced in subsequent issues of the Monthly Labor Review. On the other hand the Bureau of Census releases information concerning the Census of Manufac- tures in both processed and printed form, much of whkrh is never reprinted in the bound volumes. The Bureau of Public Roads of the Department of Agriculture distributes various printed releases con- cerning state and federal gasoline taxes and automobile registrations and license fees. Some of this information is not reprinted in bound volumes. Knowledge of these various releases must be acquired by experience since they are not always included in check lists of gov- ernment publications. 88 Appendix A, No. 8. 84 Appendix A, No. 23. 85 Appendix A, No. 10. 86 Appendix A, No, 11. 8T Appendix A, No. 12. M Appendix A, No. 23. 192 BUSINESS STATISTICS Regularity of Publication Some published sources appear at regular intervals; others appeal irregularly. Regular. The question of regularity is important from several points of view. Business men have learned to expect weekly publica- tions to be delivered in a certain mail. They delay decisions until the arrival of the latest statistical report. Absolute regularity of publication is required to meet this demand. Statisticians depend upon regular publications for current data in carrying on their research work. The business community in general expects to receive regular information from weekly or monthly publications. In other cases no exact date of publication is observed but publication is certain each week, or month, or quarter, or year. Regularity is, of course, a great virtue in source material and the great majority of publications possess it. Irregular. There are other publications which appear irregularly although the data which they contain are collected quite regularly. The Census of Manufactures contains data that are collected biennially but are published whenever the Department of Commerce is able to prepare the data and funds are available to meet the cost. Only those irregular publications that are already in print can be included in the plan of an investigation. On the other hand regular publications which are scheduled to appear while an investigation is in progress can be included. An example will perhaps clarify this distinction. As this is being written parts of the results of the 1940 Census of Population have been released. It would not be possible, however, to plan an investigation requiring the use of the complete Census or any unpub- lished part of it because there is no way of knowing when the needed data will be published. Special Studies. While the most valuable sources are those which are published at regular intervals and which consequently keep the information up-to-date, there are many special studies that appear from time to time containing data which are available in no other sources. These studies are reports of special researches. Usually they are models of the application of statistical methods in practical work. They are consequently valuable references for data, methods of analysis, and form of presentation. Examples of this type of work are the Cost of Living Studies of the Department of Labor made in 1918 and in 1936. 80 89 Appendix A, No. 22. CLASSIFICATION OF LIBRARY SOURCES 193 There are also important non-government publications of this type among which the Retail Clothing Survey made by Northwestern Uni- versity 40 and the Study of Income in the United States, 1909-19, by the National Bureau of Economic Research 41 are excellent examples. Publishing Agency This classification of sources has probably the greatest practical value in library research. The publications are usually catalogued on this basis in the libraries. In discussing this classification it will be well to recall that only those sources that are related to the field of business are included, no attempt being made to include sources of statistical data in other fields. United States Government. The most important publishing agency is the federal government, chiefly through the executive branch which includes all the departments and independent offices. 42 An outline of its plan of organization is given in Figure 32. Some of the departments and certain bureaus and offices in particular produce a tremendous amount of statistical data, whereas others by the very nature of their work produce none. In searching for data, one quickly becomes familiar with the Bureau of Foreign and Domestic Commerce and the Bureau of the Census in the Department of Commerce; the Bureau of Labor Statistics in the Department of Labor; the Bureau of Agri- cultural Economics in the Department of Agriculture; the Bureau of Internal Revenue in the Department of the Treasury; the Bureau of Mines in the Department of the Interior; the Interstate Commerce Commission; the Federal Trade Commission; and the Board of Gov- ernors of the Federal Reserve System. The list of sources given in Appendix A at the end of the chapter is arranged according to publishing agency and includes the most im- portant statistical publications of these bureaus and other offices. In addition to the regular government publications which are usually issued as periodicals or yearbooks, there are often useful data in the annual reports of department, bureau, and division heads. Several of 40 Costs, Merchandising Practices, Advertising and Sales in the Retail Distribution of Clothing, Vol I- VI, 1921. Selling E\penses and Their Control in the Retail Dntrihutwv of Clothing, Vol VII, 1922. New York: Prentice-Hall, Inc. 41 Volumes I and II of the Publications of the National Bureau of Economic Re- search. New Yoik **The United States Government Manual, published three times yearly by the Office of Government Reports, Washington, D. C, gives complete and up-to-date information on the organization and activity of all subdivisions of the federal government. The outline in Figure 32 is reproduced from this source. 194 BUSINESS STATISTICS these are listed in Appendix A. Another is the Annual Report of the Commissioner of Immigration (Department of Labor), which gives data concerning immigrants and emigrants in more detail than can be found in any other publication. Some annual reports are available only as numbered documents of the Congress to which they were sub- mitted, but others are published separately. Finally, the special investigations made for congressional committees should be mentioned. These are usually detailed studies of a particular subject and as such are unique sources. They are likewise usually pub- lished as Congressional Documents. Excellent examples are the Marine Insurance Investigation of 1920 43 and the Chain Store Investigation of 1929-33. 44 State and Municipal Government. In many cases the most prolific sources for information concerning the individual states are the publi- cations of the federal government that have already been mentioned. In addition there are many publications by the state governments them- selves. The latter vary so much from state to state that an attempt to list them would not be feasible. Some states are far in advance of others in furnishing statistical information to their citizens. A few have begun the publication of yearbooks similar in plan to the Statis- tical Abstract of the United States. A list of sources of market data available for the various states is given in Market Research Sources** This list can be supplemented by consulting the library card catalog under the individual states. Information is likely to be found in the publications of the state departments of Agriculture, Banking and Insurance, Labor, and Highways. The publications of the Land Grant Colleges and other state institutions also contain valuable special data. Very few source books are published by municipalities, but con- siderable local information is available in federal and state publica- tions. For example, the Census of Manufactures 4 * includes data for individual cities; likewise the Monthly Labor Revieiv* 1 gives a retail food price index monthly for 51 cities; and the Industrial Bulletin** 43 S. S. Huebner, Report of Status of Marine Insurance in the United States, and Re- port on Legislative Obstructions to the Development of Marine Insurance in the United States. 44 Investigation for the Senate Committee on the Judiciary by the Federal Trade Com- mission Published in several parts in 1933 as Numbered Documents of the 72nd Congress 45 Appendix A, No. 7. "Appendix A, No. 12. 47 Appendix A, No. 16. 48 New York State Department of Labor. CLASSIFICATION OF LIBRARY SOURCES 195 provides monthly data on employment and payrolls for various cities in New York State. Non-Government. State and local information is also available through the publications of certain semi-public organizations. The best examples are the monthly reviews of business issued by the 12 Federal Reserve Banks 49 and reports of statistical studies by the research bureaus of universities. In addition there are many private agencies which publish statis- tical data for general use. In a majority of cases the data are collected for the use of an interested group such as the members of a trade association or the subscribers to a service, but are made generally available through magazines and trade papers. There are other cases in which data are collected and published merely to increase the value of a magazine to the reading public. In any event the cost of making the data available must be borne by the subscribers to the publication. Historically, private agencies preceded the government in supplying current data to the public. Among the pioneers in this field were Dun's Review, Bradstreefs Review, The Commercial and Financial Chronicle, and Babson's Service. The private agencies compiling and publishing statistical informa- tion may be classified as follows: trade, industrial, and financial asso- ciations; financial magazines; statistical services; and trade and industrial magazines. The outline of sources used in Appendix A conforms to this one. The examples given there are some of the most commonly used non-government sources. The list has purposely been confined to only a few of the multitude of publications which might have been included. Many of those omitted contain data of value; hence the need for gradually expanding one's knowledge of them as progress is made in the use of source material. Foreign and International. While the agencies already named provide most of the data needed for statistical work, there are occa- sions which call for the use of information from foreign countries or for world data. A list of sources published in foreign countries as well as publications containing world data can be found in The Economists' Handbook A Manual of Statistical Sources. 40 Appendix A, No. 36. These are actually private organizations but because of theii close integration with the Federal Reserve system their bulletins have been listed as gov- ernment publications. so Verwey and Renooiz, Amsterdam. 1934. 196 BUSINESS STATISTICS Summary The use of published sources of business data requires a knowledge of government and non-government publishing agencies and the titles of their publications, as well as a knowledge of the form, frequency, and regularity of publication and the type of data contained. The classification according to publishing agency is the most generally usable one; hence it has been employed in Appendix A. It is necessary to remember, however, that source material is con- tinually changing with respect to all of these classifications. That is, the type of data may be changed by the addition of new series and the elimination of old ones, or by changing the titles of series and the data included in them. Likewise the form of publication may change. Data formerly scattered throughout the publication may be brought together in a statistical appendix or published separately. On the other hand statistical compendiums may be abandoned. Again the frequency of publication changes when a weekly publication becomes monthly or the reverse; when an annual is supplemented by a monthly and/or weekly or when a weekly or monthly issue begins publication of an annual supplement. The regularity of publication also undergoes changes. Sources which have appeared regularly for years may be discontinued entirely 51 or subsequently may appear irregularly. Finally, changes occur in publishing agency and in titles of publications. For example, the material formerly found in Eradstreefs Review is now found in Dun's Review; the Bureau of Mines of the Department of Interior was for several years in the Department of Commerce; the former Commerce Yearbook, Volume II, Foreign Commerce is now the Foreign Commerce Yearbook. In the light of these changing conditions it may readily be under- stood that any list such as that given in Appendix A loses its accuracy after a few years. It is therefore necessary for users of published source material to keep abreast of current changes as they occur. The list given in the appendix provides an adequate nucleus which can be kept up-to-date by noting additions and changes from time to time. 81 One of the most useful references, The Annalist, heretofore published weekly by the New York Times Co., was discontinued in October, 1940. Although many of the series of data carried by this publication are no longer available currently, the volumes for earlier years contain valuable data. CLASSIFICATION OF LIBRARY SOURCES 197 APPENDIX A SELECTED SOURCES LISTED ACCORDING TO PUBLISHING AGENCY, TITLE, FREQUENCY OF PUBLICATION, AND CONTENTS (REVISED TO OCTOBER, 1940) UNITED STATES GOVERNMENT SOURCES Department of Commerce Bureau of Foreign and Domestic Commerce 1. Survey of Current Business (monthly and weekly, with occasional yearly supplements) The monthly issues present data for the United States concerning business indexes, commodity prices, construction and real estate, domestic and foreign trade, employment, finance, transportation and communication, and statistics of industry in 12 general subdivisions; also some Canadian data. Brief summary statements and tables are given concerning each, followed by monthly statistics that cover the preceding 13 months. The weekly pamphlet brings some of the monthly series up-to-date in advance of the monthly issue, and gives weekly data for a few important series. The supplements to date have been issued for the years 1931, 1932, 1936, 1938, and 1940. Each covers a number of years, and taken together (except for subsequent revisions) they furnish continuous monthly data, including monthly averages for every year since each series has become available. Full notes are appended explaining the sources and methods of construction of each series. 2. Monthly Summary of Foreign Commerce of the United States (monthly) Gives dollar value and quantity of all goods exported from and imported into the United Staes including gold and silver. Includes a detailed classification of both exports and imports by articles and by customs districts. Since each month's issue contains data for that month and the calendar year to date the December issue contains the total for the year. 3. Domestic Commerce (weekly) Presents digests of important studies both government and non- government, summaries of federal bills, laws, and court decisions, also statements of recent publications by the various government agencies, a list of recent publications dealing with domestic commerce, and some data regarding changes in wholesale and retail trade along specific lines. 4. Foreign Commerce Yearbook (annually) Purpose is "to provide in a single volume all the important basic statistical material essential for a comprehension of current economic developments in foreign countries." Part I gives data for each country separately, total trade and trade with the United States being given by 198 BUSINESS STATISTICS specific commodities; Part II gives comparative world statistics on population, production in agriculture and industry, and trade. 5. Foreign Commerce and Navigation of the United States (annually) Detailed tables of specific items of export and import by countries; also number and tonnage and ports of arrival and clearance of American and foreign vessels. 6. Consumer Market Data Handbook, 1939 (intervals of 3 or 4 years) A nation-wide survey of the markets for consumer goods presenting "on a comparable basis for all counties, and in most cases for all urban communities, just how much consumers spent in retail stores, and in service, amusement and hotel establishments; what wholesale business amounted to; how many of the consumers' homes had telephones and electric meters; how many persons made out an income tax return; how many passenger automobiles were registered ; what the relief load amounted to; and other factors indicative of the relative importance of each market." An Industrial Market Data Handbook was also published in 1939, giving data regarding the industries in every county in the United States. 7. Market Research Sources (biennially) Complete and thoroughly cross-referenced lists of all government and non-government sources relating to problems of domestic marketing. Department of Commerce Bureau of Census 8. Statistical Abstract of the United States (annually) Prior to 1938 pub- lished by Bureau of Foreign and Domestic Commerce. "A digest of data collected by all statistical agencies of the national government," as well as by some states and private agencies. It consists entirely of summary tables and time series, chiefly annual data, with notes defining the scope and terms used in each, and referring to the original sources. 9. Abstract of the Census (decennially) A selection of the most essential statistics collected on all subjects at each census, in one volume. Data are given by subjects, states, and cities, and some by counties and smaller civil subdivisions. Some data are included covering outlying territories and possessions of the United States. 10. Census of Population (decennially) Data are classified by states, counties, cities or villages, and minor civil subdivisions. The subjects included are: color, race, nativity, parentage, origin of foreign born, sex, marital condition, age, urban and rural distribution, citizenship, school attendance, and literacy. Separate volumes give data on occupations and families, and sometimes unemployment. Reports on special groups or subjects are published at irregular intervals CLASSIFICATION OF LIBRARY SOURCES 199 in intercensal years, such as Religious Bodies, Benevolent Institutions, The Blind and the Deaf, Negroes in the United States, etc. A series of Mortality Statistics is published annually. 11. Census of Agriculture (quinquennially) The year which coincides with a decennial census affords somewhat more detailed information than the intervening non-census year. In both, data are given by states and counties, for the number of farms, color and tenure of farm operator, uses of farm land, value of land and buildings, acreage, production and value of specified crops, and value of livestock by principal classes and age groups. In the decennial years, classifications are made also by minor civil subdivisions, and special reports such as irrigation are included. 12. Census of Manufactures (biennially) The reports for years which coincide with a decennial census are given in greater detail than those taken in the intervening non-census years. Mines and quarries have been covered only in the decennial years (see 1935 report under Census of Distribution). All reports give data concerning number of establishments ; number of wage earners and salaried employees ; amount of wages and salaries paid; cost of materials, fuel, and power; value of products; and value added by manufacture. Classifications are by industry groups, by states, by cities, and in some years, notably 1929, by industrial areas. 13. Census of Distribution, and Business Censuses ( 1 ) Census of Distribution, 1929 This first attempt to gather nation-wide business statistics was a part of the fifteenth census. It included retail trade, wholesale trade, distribution of manufacturers' sales, contract construction, and hotels. (2) Census of American Business, 1933 This covered the same field as the 1929 Census of Distribution. with the addition of services and places of amusement. Both afford data on number of establishments, net sales or receipts, personnel, and payroll. They are classified by field and kind of business, and by states, with some basic data for cities and counties. (3) Census of Business, 1935 This is more comprehensive than either of the preceding censuses. Subjects added are: transportation and warehousing, tourist camps, radio broadcasting and advertising agencies, banking and finance, insurance, mines and quarries. (4) Census of Business, 1939 (part of sixteenth census) Will be practically the same as 1935. 200 BUSINESS STATISTICS 14. Financial Statistics of Cities (annually) Shows the financial transactions of cities having a population of over 100,000, including taxes, indebtedness, specified assets, government costs and receipts. 15. Financial Statistics of State and Local Governments (published decennially several years subsequent to census) "Statistics relating to revenue receipts, governmental cost payments, public debt, and assessed valuations and tax levies, for all divisions 'of government." The classifications are by states, counties, cities, towns, villages, boroughs, school districts, townships, and other civil divisions. Financial Statistics of States is also published annually when funds permit, but there was none between 1931 and 1937. Department of Labor Bureau of Labor Statistics 16. Monthly Labor Review (monthly) Contains brief reports and complete statistical data on all matters handled by the department. Each issue contains sections on industrial disputes, wages and hours of labor, employment and payrolls, wholesale and retail prices, and cost of living, and usually other sections on labor legislation, industrial accidents, etc. There are always a few special articles on timely subjects concerning labor, and a list of recent publications by the department. 17. Wholesale Prices (monthly) Monthly index numbers of wholesale commodity prices by groups and subgroups. Each issue gives the group and subgroup indexes and detailed indexes for certain groups. Comparisons are given with the same month in previous years, and with foreign countries. The December issue gives indexes for the 12 months of the year for the entire series of more than 800 commodities. 18. Retail Prices (monthly) Index numbers of retail prices of food, coal, electricity, gas, and other consumers' goods. Food data are given every month; coal, electricity and gas at frequent intervals; and other commodities less frequently. 19. Changes in the Cost of Living (quarterly) Index numbers of changes in the cost of living, divided into 6 groups: food; clothing; rent; fuel, electricity, and ice; house furnishings; and mis- cellaneous, for 33 cities. 20. Employment and Payrolls (monthly) Index numbers of employment, payrolls, hours worked, and weekly earnings for all manufacturing and non-manufacturing industries. It in- cludes data for employment and payrolls in the regular executive depart- ments of the federal government and on emergency work. CLASSIFICATION OF LIBRARY SOURCES 201 21. Labor Information Bulletin (monthly) Brief summary of labor conditions. Hours of work, wages, cost of living, employment and payrolls, wholesale prices, retail food prices, indus- trial production and trade, agriculture, and government employment and relief for the month. 22. Numbered Bulletins (irregular, several each year) "Each bulletin contains matter devoted to one of a series of general subjects, those subjects being Wholesale Prices, Retail Prices and Cost of Living, Wages and Hours of Labor, Employment and Unemployment," and many other subjects of interest to labor, but not of a statistical nature. Bulletin No. 661 gives a selected list of these bulletins as of 1938, and the most recent ones are listed on the back cover of each Monthly Labor Review. In recent years the tendency appears to be not to issue these bulletins on subjects covered by the regular monthly pamphlets listed above, so that the majority of the current bulletins deal with Wages and Hours of Labor in specific industries. One of the most important of the series is No. 357, Cost of Living in the United States, published in 1924 and giving a complete statistical report of the first extensive cost-of -living study, made in 1918-19. The more recent study made in co-operation with the Works Progress Administration is being reported in a series of bulletins beginning in 1936 under the general title "Studies of Consumer Purchases." On June 30, 1939 a special pamphlet (unnumbered) entitled Publica- tions of the Department of Labor was issued. This contains a complete list of all the publications of the various bureaus of the Department of Labor since it was organized. Of particular value are the detailed descrip- tions of changes of form and content which were made during the entire period in the various series published by the department. Department of Agriculture 23. Agricultural Statistics (annually) Prior to 1936 was included in the Yearbook of Agriculture. Presents summary tables of all statistical data appearing in periodicals of the depart- ment, in great detail, usually covering a series of years. These include data on all United States crops, livestock, poultry and dairy products; farm business; foreign trade in agricultural products; and some data on world production. 24. Crops and Markets (monthly) Each issue gives complete detailed reports, estimates, and forecasts on United States crops and other farm products, the items included varying with the seasons. Sectional data are given, and comparison with preceding years. Prices, wages, labor supply, stockyards, and market reports, and some items of export and import are included. 202 BUSINESS STATISTICS Department of the Interior Bureau of Mines 25. Minerals Yearbook (annually) Gives a general survey of mineral production in the United States and the world, and separate chapters dealing with each metal and non-metal. It contains the most recent compilation of statistical data on the more im- portant minerals: coal, natural gas, petroleum, stone, gold, silver, copper, lead, and zinc, etc. 26. Several weekly and monthly bulletins on production and distribution of anthracite coal, bituminous coal, and coke. War Department 27. Report of the Chief of Engineers, U. S. Army, Part 2, Commercial Statistics (annually) A complete review of water-borne commerce of the United States both domestic and foreign, freight and passenger, subdivided by grand divisions, by ports, and by commodities. All data are annual for calendar years. Post Office Department 28. Annual Report of the Postmaster General (annually) Contains detailed statistical analysis of all receipts and expenditures of the department, for the fiscal year ending June 30; the number of post offices and employees; mail carried by each type of carrier; and money order transactions. Treasury Department 29. Annual Report of the Secretary of the Treasury on the State of the Finances (annually) Statistical data for the fiscal year ending June 30 on receipts, expendi- tures, deficit, public debt, and monetary developments. The "exhibits" on public debt contain statements of all outstanding obligations (bonds, treas- ury notes, treasury bills, treasury savings certificates, and currency) issued by the United States government. Also list of securities owned by United States government, and statement of assets and liabilities of government corporations and credit agencies of the United States. 30. Annual Report of Comptroller of Currency (annually) Report for the fiscal year ending October 31 covering in great detail all banking operations in the United States and money in circulation. State- ments are included of Reconstruction Finance Corporation, Farm Credit Administration, Federal Home Loan Bank System, Federal Deposit Insur- ance Corporation, Pacific National Agricultural Credit Corporation, and United States Postal Savings System. CLASSIFICATION OF LIBRARY SOURCES 203 31. Combined Statement of Receipts and Expenditures, Balances, etc., of the United States (annually) Very detailed statement of receipts and expenditures of each depart- ment and independent office. Treasury Department Bureau of Internal Revenue 32. Statistics of Income (published annually, about 2 years late) Detailed data on income tax returns by individuals, partnerships, and corporations, estate tax returns, and gift tax returns for the United States and individual states. Data for counties, cities, and towns are available in separate mimeographed bulletins. Beginning with 1934 corporation tax returns are published separately as Part II. 33. Annual Report of Commissioner of Internal Revenue (annually) Detailed statistical report for the fiscal year ending June 30 on all tax revenue of the United States, including income taxes and all other mis- cellaneous taxes. A few of the tables give monthly data and comparison with previous years. Board of Governors of Federal Reserve System 34. Federal Reserve Bulletin (monthly) The only official statement by the Board concerning the operations of Federal Reserve banks and member banks. Monthly or weekly data are given concerning financial, industrial, and commercial statistics in the United States; international financial statistics; and several indexes con- structed by the Division of Research and Statistics of the Federal Reserve Board on industrial production, construction, employment and payrolls, freight car loadings, and department-store sales. Summaries and discussion of current financial events, legislation, etc., appear in each issue. 35. Annual Report of the Board of Governors of the Federal Reserve System (annually) Data similar to those in the monthly issues, but given for a series of years, some dating back to 1914. 36. Monthly Publications of Federal Reserve Districts (monthly) The Federal Reserve Bank of each of the 12 districts publishes a monthly bulletin summarizing business conditions in that district. Federal Home Loan Bank Board 37. Federal Home Loan Bank Review (monthly) Contains data concerning housing and building conditions including building permits, building costs, mortgages, building and loan association activity, government housing activity. It includes the monthly operating 204 BUSINESS STATISTICS statement of the Home Owners Loan Corporation and the financial state- ment of the Home Loan Banks. Important special articles on housing appear in each issue. 38. Annual Report of Home Loan Bank Board (annually) Contains data for recent years similar to those in monthly issues. Federal Power Commission 39. National Electric Rate Book and State Rate Books (periodic intervals) 40. Monthly Bulletin (monthly, and an annual summary) Production of electric energy in the United States, sources of energy by states, average daily production by public utility plants. Federal Communications Commission 41. Operating Data from Monthly Reports of (a) Telephone; (b) Telegraph Carriers (monthly) a) A report of detailed operating revenues, operating expenses, income items and changes in capital items, by regions for telephone carriers giving the current month and cumulative totals for the year to date. b) A report of detailed revenue, expenses, and income of individual tele- graph carriers giving the current month and cumulative totals for the year to date. Interstate Commerce Commission Bureau of Statistics 42. Statistics of Railways in the United States (annually) Summary statements concerning equipment, employees, revenues, ex- penses, and other data for all steam railways in the United States usually classified by districts. There are also separate reports from each company. 43. Annual Report of the Interstate Commerce Commission (annually) Contains a statistical appendix giving data on railway development for a preceding period of years. Also contains miscellaneous summaries of data on operating revenue, expense and income, operating ratios, employment, and car loadings. 44. Freight Commodity Statistics Class I Steam Railroads in the United States (annually) Includes annual data on car loadings by districts and groups of com- modities for the preceding ten years, also quarterly data by districts and commodity groups for the latest available year. The data are also broken down into individual commodities carried by individual railroads. CLASSIFICATION OF LIBRARY SOURCES 205 A quarterly supplement to this report, having the same title and giving the car loadings for the most recent quarter classified by districts and by individual commodities is also published. 45. Wage Statistics of Class I Steam Railways in the United States (monthly) A complete statement by occupations of the number employed, time worked, and wages received, with summaries. 46. Statistics of Class I Motor Carriers (annually) Data regarding motor transportation of property and passengers. NON-GOVERNMENT SOURCES Financial, Trade, and Industrial Associations 47. Reports of National Industrial Conference Board by National Industrial Conference Board, Inc., New York a) The Economic Record (semi-monthly) Data on wages, earnings, hours, and employment by individual indus- tries; also cost of living. All data except retail food prices are collected by the Conference Board, and are independent of similar series published by the Bureau of Labor Statistics. b) The Management Record (monthly) Data similar to those in Economic Record with articles of interest to employers. c) Special studies (irregularly) as supplements to The Economic Record. 48. Annual Statistical Report of American Iron and Steel Institute (annually) New York Data concerning all iron and steel products, classified by types and by states for a period of years. The report includes also data on prices, for- eign trade, production in other countries, and some information on allied industries, as coal and coke. 49. Electrical Research Statistics (monthly) by the Edison Electric Institute, New York A single sheet giving classified data of production, consumption, and sales of electric power. A similar sheet is also issued weekly, and a more comprehensive annual bulletin. This series supersedes similar bulletins published until 1937 by the National Electric Light Association, New York. 50. Automobile Facts and Figures (annually) by Automobile Manufacturers Association, New York Devoted exclusively to data related to the automobile industry; includes production, sales, registration, taxation, financing, exports, truck trans- portation, used car sales, and allied data. 206 BUSINESS STATISTICS 51. Statistical Bulletin (annually) by American Petroleum Institute, New York A complete collection of data relative to the petroleum industry includ- ing production, consumption, imports and exports, and stocks on hand for the various products. The data are given monthly for the last two years and annually back to 1918. Monthly supplements of the Statistical Bulletin give current figures comparable with those in the annual issue. Additional current data dealing mainly with crude oil production by producing areas are supplied weekly. 52. Exchange (monthly) by New York Stock Exchange, New York Supersedes the New York Stock Exchange Bulletin, giving summary data of the activities of the New York Stock Exchange, number and volume of sales, etc. 53. Monthly Survey of Life Insurance Sales in the United States (monthly) by Life Insurance Sales Research Bureau, Hartford, Conn. A report of new ordinary insurance written for the current month and for the year to date, subdivided by states and regions. In 1937 a special report was published giving revised sales figures monthly from 1930 to 1936. Financial Magazines and Papers 54. The Commercial and Financial Chronicle (weekly) by Wm. B. Dana Co., New York Gives stock and bond quotations on the various exchanges for the pre- ceding week, banking and financial data currently reported, corporation balance sheets and statements, general industrial, trade and commodity data, news, and comments. Difficult to use because of variable content from week to week, but a valuable source for a wide variety of data. 55. Business Week (weekly) by McGraw-Hill Publishing Co., Inc., New York A page entitled "Figures of the Week" contains data on production, trade, prices, finance, and banking. 56. Barron's (weekly) by Barren's Publishing Co., Inc., New York Material very similar to the Commercial and Financial Chronicle, but presented in somewhat more popular style. Features several original indexes, barometers, etc. 57. Dun's Review (monthly) by Dun & Bradstreet, Inc., New York General analysis of business conditions including regional indexes. The original source for data on business failures and indexes of wholesale com- modity prices. Each month Dun's Statistical Review is published as a supplement to the regular magazine. This supplement contains more detailed data on the subjects included in the magazine. CLASSIFICATION OF LIBRARY SOURCES 207 58. Wall Street Journal (daily, except Sundays and holidays) by Dow-Jones & Co., Inc., New York Current events of economic and financial interest in United States and world. The previous day's quotations on stocks and bonds on exchanges throughout the country and in foreign countries, as well as commodity in- formation are given. Indexes of stock, bond, and commodity prices, divi- dend payments, and industrial data are included. Statistical Services 59. Standard Trade and Securities, Statistical Bulletin (annually and monthly) by Standard Statistics Co., Inc., New York A complete record of monthly data concerning business and financial operations running back as far as the data are available. This is one of the most valuable reference books in print. The series are kept current in monthly supplements which may be bound with the most recent annual volume. 60. Moody s Manuals of Investment (annually) by Moody's Investors Service, New York Contains financial statements of several thousand corporations both domestic and foreign, including a brief history, balance sheet and income statement of each corporation and a record of securities in the hands of the public. There are five volumes published each year industrials, rail- roads, public utilities, banks and finance, government and municipals. Trade and Industrial Magazines (The titles of these magazines give sufficient indication of the type of data contained, consequently the descriptions have been omitted) 61. Iron Age (weekly) by Chilton Co., Philadelphia, Pa. 62. Steel (weekly) by Penton Publishing Co., Cleveland, Ohio 63. Metal Statistics (annually) by American Metal Market Co., New York 64. India Rubber World (monthly) by Bill Brothers Publishing Co., New York 65. Textile World (monthly) by McGraw-Hill Publishing Co., Inc., New York 66. Northwestern Miller (weekly) by The Miller Publishing Co., Minneapolis, Minn. 67. Automotive Industries (weekly) by Chilton Co., Philadelphia, Pa. 68. Railway Age (weekly) by Simmons-Boardman Publishing Co., New York 69. Marine Engineering and Shipping Review (monthly) by Simmons-Board- man Publishing Co., New York 208 BUSINESS STATISTICS 70. Engineering and Mining Journal (monthly) by McGraw-Hill Publishing Co., Inc., New York 71. Printers' Ink (weekly) by Printers' Ink Publishing Co., New York 72. Chain Store Age (monthly) by Chain Store Publishing Co., New York PROBLEMS 1. A young stockbroker interested in general business conditions is planning a small library of statistical source material. The following list has been selected as adequate: World Almanac, current year; subscription to Monthly Summary of Foreign Commerce; subscription to Commercial and Financial Chronicle; subscription to Business Week; Statistical Abstract of the United States, most recent volume; Vol. I of Population, Sixteenth Census. a) Which of the foregoing would you retain? b) Name four others that should be included. c) Give reasons for your choice in (a) and (b). 2. Name the publications that correspond to the following descriptions: a) Published monthly, by a government agency, containing some textual material and about 50 pages of tables that are practically identical in form from month to month, chiefly on the subject of finance. b) A 4-page leaflet published weekly by a government agency, containing certain indexes and other current weekly data in every issue; and also a number of series of monthly data, some of which appear in one issue and some in another, during each month. c) A group of large volumes published annually by a private company, each volume of which contains complete information concerning individual corporations of a certain type. d) A monthly government publication dealing solely with exports and imports, the December issue of each year constituting a summary of that year's data. e) A volume published by a private statistical concern, containing long series of monthly data and index numbers on every phase of business, the series being kept up-to-date by the addition of current supplements. /) A series of volumes, published at intervals of an irregular number of years and under slightly different titles, by a government agency, each issue containing the most complete data available in the United States on trade and various aspects of business other than industrial production. g) A weekly periodical, non-government, each issue of which contains current data on steel prices, with much more complete data on produc- tion, shipments, etc., in a large special number issued during January of each year. CLASSIFICATION OF LIBRARY SOURCES 209 3. From the following list, describe each publication according to the five methods of classification named in chapter IX (instructor will assign one or more to each student according to the time available) : (a) Monthly Labor Review; (b) Minerals Year Book; (c) Survey of Current Business; (d) Abstract of the Census; (e) Census of Business; (/) Statistical Abstract; (g) Monthly Summary of Foreign Commerce; (b) Moody*s Manuals of Investments; (/) Commercial and Financial Chronicle. 4. (Class exercise.) Name a source in which you think each of the follow- ing sets of data would be available. Explain your choice in each case. a) The number of tons of pig iron produced in the United States monthly from 1929 to 1936 inclusive. A) The number of employees on the payrolls of manufacturing concerns in the United States in 1934, 1935, and 1936. c ) The number of gallons of gasoline consumed monthly in the United States during the first six months of last year. d) The number of automobiles produced in the United States in 1935. e) The amount of sales by chain grocery stores in the state of New York in 1939. /) The value of agricultural products exported by the United States during the most recent month. g) The number of freight car loadings of grain and grain products shipped in the United States during the year before last. h) The index of department-store stocks for the United States, for the most recent month. 5. Give exact reference to a publication (not mentioned in the text) con- taining numerical data not in tabular form. 6. List publications (not mentioned in this section of the text) illustrating each subheading under the classification "Frequency of Publication." 7. Give exact reference to a special statistical study (not mentioned in the text). CHAPTER X THE USE OF LIBRARY SOURCES INTRODUCTION ALJPERFICIAL consideration of the matter might easily lead one to expect that the entire task of collecting data from library sources consists in copying a quickly discovered list of figures from a book readily supplied by a library attendant. This is not what usually happens. Only in highly specialized libraries will an attendant be found who is trained in the intricacies of source material. In most cases the library staff will not be able to render any greater service than that of obtaining books and magazines from the stacks on request. Efficiency in collecting data from libraries comes only with long practice. It is a case primarily of learning to know what data to expect in different sources. While the beginner has no choice but to use what might be called the "shotgun" method, that is, search until the desired data happen to be found, a seasoned investigator uses a process of elimination based on his previous experience to narrow his search to two or three likely sources. By contrast this might be called the "rifle" method. If his selection has been accurate very little time will be required to find the data, or to obtain a guide as to where they may be found, or to discover that they are not available. In passing from the "shotgun" to the "rifle" method, there are two major things to be considered: (l) how to find a good source and (2) how to use it after it has been found. FINDING A GOOD SOURCE The purpose of this section is to set up a sequence of steps which can be generally employed in searching for a desired set of data. The process is one of successive elimination, but some guidance in the order of procedure will facilitate the work. There are usually two stages in the search, finding data on the general subject and finding a par- ticular set of data. There is no way of knowing in advance at what point the search should be concentrated on specific information. That must be determined in individual cases according to the circumstances. 210 THE USE OF LIBRARY SOURCES 211 Steps The following steps are suggested in making a search of library sources. Step 1 . There are several standard reference sources which should be consulted for information on the desired subject. These are: Statistical Abstract of the United States, 1 Agricultural Statistics? Survey of Current Business? Monthly Labor Review* Federal Reserve Bulletin? Standard Trade and Securities Statistical Bulletin* Look in the indexes of these publications for the subject of the search. If the particular data can be obtained from one or several of them the search is ended. Step 2. If the desired data cannot be found in these sources, study the titles, headnotes, footnotes, and references of tables on the general subject to discover the original sources which may contain more detail. Study these detailed sources in turn for references to collateral sources. Step 3. If steps 1 and 2 have not led directly to the publication containing the precise information required, it is time to consult a bibliography of source material. The current edition of Market Re- search Sources 1 provides the most useful guide for any subject related to domestic marketing. It contains a full "finding guide" of subjects followed by a list of government and non-government publications classified according to publishing agency. Books and yearbooks are included as well as periodicals. If current data are desired, it is quite likely that their origin can be traced through the use of another publication of the United States Department of Commerce entitled Sources of Current Trade Statistics? This book is arranged in ready reference form so that the source of a particular series of data can be found through a finding index in the first part of the book and a list of sources in the second part. Neither the finding index nor the list of references includes any annual publi- cations or statistical compendiums. For example, the Statistical Abstract and the Standard Trade and Securities Statistical Bulletin are not mentioned. This guide does, however, include some references on 1 Appendix A, No. 8. 2 Appendix A, No. 23. 8 Appendix A, No. 1. 4 Appendix A, No. 16. 8 Appendix A, No. 34. 6 Appendix A, No. 59. 7 Appendix A, No. 7. 8 Latest edition to date, June, 1937. 212 BUSINESS STATISTICS foreign trade which is one of the few subjects not covered by Market Research Sources. Data expressed in index number form can often be located by referring to An Index to Business Indices. 9 This book contains a finding index that is convenient to use in locating the detailed descriptions of indexes appearing in the second part of the book. These descriptions include the names of the source or sources in which the desired index can be obtained. Step 4. At this point the card catalogue of the library should be consulted if the data have not been found. The cards are classified by author, title, and subject. Look up the subject concerning which you want to get the data. You will probably find references to non-govern- ment publications. Select those which are likely to contain data and investigate them. If the data are still elusive the next reference should be to the government list of publications. These are sometimes not in- cluded in the main subject catalogue of the library but are listed sepa- rately under "United States." The classification is by departments, bureaus, commissions, and offices. The publications most likely to yield results are listed in Appendix A. Step 5. Each month the Government Printing Office issues the Monthly Catalogue of United States Public Documents which includes all publications during that month. Several monthly catalogues should be examined to discover any recent publications on the subject of the search. This "check list" is classified by departments, bureaus, etc. Step 6. If access to the stacks of the library is possible, the search should be continued there. Go to the section in which you have already found books dealing with the subject and there perhaps other publi- cations will be found which contain the desired data. Step 7. Look through trade, financial, and technical magazines. The ones most likely to be productive will be determined by the nature of the subject. Some of these are listed in Appendix A. Step 8. If the data are still elusive or perhaps incomplete go through the periodical indexes which are found in the library. The following are most frequently available: Readers Guide to Periodical Literature, Industrial Arts Index, 11 Public Affairs Information Serv- ice, New York Times Index Donald H. Davenport and Frances V. Scott, An Index to Business Indices, Chicago: Richard D. Irwin, Inc., 1937. 10 H. W. Wilson Co., New York. 11 H. W. Wilson Co., New York. 12 Public Affairs Information Service, New York. New York Times Co.. New York. THE USE OF LIBRARY SOURCES 213 Step 9- If at this point the desired data have not been found, it is time to consult some experienced person who may have knowledge of them. The experienced person for students means the teacher; for research workers, a fellow-worker or director. Finally, it may be desirable to write to a government or non-government agency for the information. The United States Information Service, 1405 G Street N.W., Washington, D. C, has been established as a Division of the Office of Government Reports to answer inquiries regarding the departments and agencies of the federal government. Only in the most difficult cases will it be necessary to employ all of these steps. Usually the first two or three will be productive. After a few searches have been made the general contents of the major pub- lications will be sufficiently familiar so that in most cases the proper source can be selected immediately. The further one progresses in the use of library sources the less the need for formal methods and the greater the reliance on experience. Examples of Library Search Some problems for library search were assigned to a student who had a slight acquaintance with the titles of the various statistical pub- lications but very little knowledge of their contents. His report of the results of the search is reproduced as Appendix B at the end of this chapter. The report portrays the student's reaction to success or failure during the search with a sincerity which could not have been simulated by the authors if they had attempted to write this appendix. The examples were arranged so that successive ones would require the use of additional steps of the search process. Careful study of the student's explanations will show that in doing the first few examples he acquired considerable knowledge of the contents of standard sources which saved time in the later examples. This experience and the simi- lar experiences of many other students lead to the conclusion that the only way to acquire familiarity with the contents of published sources is by handling them and searching through them for some definite piece of information. THE CORRECT USE OF DATA The search procedure of the preceding section leads to the location of a given set of data in a single source or in two or three alternative 214 BUSINESS STATISTICS sources. Before the data can be transcribed they must be put through a process of verification and tested for validity. Verification of Data The data should be verified (1) to detect discrepancies, (2) by cross-reference when multiple sources are available. Discrepancies. Discrepancies in data are usually not difficult to detect but may escape the unwary collector. They may appear as a result of one or more of the following causes. Changes in unit: Some of the things that may be expected are changes in the unit of measure, changes in the definition of the unit and changes in the nature of the unit. Illustrations of all of these changes can be found in the Statistical Abstract for 1936. An example of change in the unit of measure is shown in Table 524 which presents "Imports of Merchandise by Commodity Groups and Articles." On page 536 the first item is wood pulp. The unit used is long tons prior to 1935 and short tons beginning with 1935. A change in the definition of the unit appears in Table 247 entitled "Reporting Member Banks in 101 Leading Cities Principal Assets and Liabilities." "Demand Deposits Adjusted" is the heading of the next to the last column. Through August, 1934, the data are net demand deposits, but subse- quently are adjusted as explained in the footnote. The figures really represent two different things and cannot be regarded as a single series even though they are printed in the same column. Table 426, "Railway Equipment in Service, All Reporting Companies," shows that there was a larger number of steam locomotives in service in 1916 than in 1929 despite the greater volume of traffic hauled during the later year. This is explained by the change in the nature of the thing counted, since a locomotive manufactured in 1916 was hardly the same as a locomotive manufactured in 1929. Changes in classification: Arrangements according to time, space, or attribute may be involved. A change of the time period for record- ing railroad data occurred in 1916 when a shift was made from fiscal to calendar years. An adjustment must be made for this change if the earlier and later periods are to be combined in a single series. Changes in the boundaries of the wards of cities have the effect of changing the classification of any data reported by wards. Changes in attribute classifications appear frequently in the biennial Census of Manufac* THE USE OF LIBRARY SOURCES 215 lures, as illustrated by the following statement introductory to the section entitled "Radio Apparatus and Phonographs/ 1 At censuses taken prior to 1931, the manufacture of phonographs was treated as a separate industry, but the increasing production of radio apparatus by manufacturers of phonographs and the introduction of the combination radio-phonograph unit made it desirable to establish the present classification. Manufacturers of radio apparatus were formerly classified in the "Electrical machinery, apparatus, and supplies" industry. The schedule for this industry did not call for detailed data on the production of radio apparatus, and there- fore no comparative statistics are given for years prior to 193 1. 14 A discrepancy which is closely allied to a change in spatial classifi- cation occurs when the area for which data are reported is changed. Such changes may arise from shifts in national boundaries, in customs districts, or in navigable waters. On the other hand the changes may be of a purely statistical character, as the following examples will show. The Bureau of Labor Statistics report of building permits issued included 262 cities in 1921 and 1922. In subsequent years the number of cities has been gradually increased until in May, 1940, it reached 2,047. The figures are clearly not comparable from 1921 to 1940. Comparable figures over a period of years for 257 identical cities are published in each issue of the Statistical Abstract. The birth and death registration area of the United States is another example of changing area. Starting with Massachusetts, New Jersey, and the District of Columbia in 1880, the registration area for deaths has been gradually expanded until in 1933 for the first time all of the states were included. The birth registration area started with ten states and the District of Columbia in 1915 and expanded gradually until all of the states were included in 1933. During this period the number of births and deaths cannot be compared from year to year, but birth rates and death rates are approximately comparable. Revisions: Perhaps the best example of this type of discrepancy is to be found in Agricultural Statistics (formerly included in the Yearbook of Agriculture}. There are scarcely two yearbooks which give the same series of figures for the country's wheat production. In the issue of 1935 corrections were made as far back as 1866. While there are many other cases of this kind in recorded data, it is unlikely that many can be found which are less stable than the records of crop estimates of the Department of Agriculture. Presumably the only 14 Census of Manufactures, 1933, p. 577. 216 BUSINESS STATISTICS thing which can be done with such figures is to use the most recent issue and hope that corrections made in subsequent issues will not destroy the validity of the data used. It cannot be safely assumed, however, that the most recent or re- vised figure is always correct. Errors in revisions occur less frequently than in preliminary figures, but are more likely to be overlooked. An example appeared in the Survey of Current Business during the early months of 1937. The particular series involved was 'Total Car Load- ings." Table 24 is a reproduction of the data with footnotes intended to explain the changes as printed in three successive issues. The foot- TABLE 24 TOTAL CAR LOADINGS AS PRINTED IN THE MONTHLY SURVEY OF CURRENT BUSINESS WITH THE PERTINENT FOOTNOTES (000 omitted) As PRINTED IN 1936 THE ISSUE OF JANUARY FEBRUARY MARCH February, 1937* 2,353 3,135 2,419 March 1937f 2,975$ 3,135 2,419 April, 1937 2,512$ 2,419 Data for February, 1936, are for 5 weeks, other months, 4 weeks. tData for January, 1936, are for 5 weeks, other months, 4 weeks. ^Revised. notes do not explain ail that happened to this series. In the February issue of 1937, and in the ten preceding issues, car loadings for five weeks were included in the February, 1936, figure, giving 3,135,000 cars. Beginning with the March, 1937, issue the loadings for a week which ended February 1, 1936, were shifted from the February total to the January total. Thus the January, 1936, total was increased to 2,975,000 cars, but an equivalent deduction was not made from the February total. As a result 622,000 cars reported for the week ending February 1, 1936, were counted twice in the March, 1937, issue. The error in the February, 1936, total was corrected in the issue of April, 1937, but unfortunately the March issue is most frequently used because it contains data for the full 12 months of 1936. Typographical errors: A good example is found in the record of bank clearings printed weekly in the Commercial and Financial Chronicle. Individual clearings are printed for more than 100 cities and in that list it is not uncommon to find as many as five changes in the data copied from the previous week. There is no way of knowing THE USE OF LIBRARY SOURCES 217 which is the misprint since no explanations are included. Such errors are most likely to occur in publications which are not carefully proofread. Interruptions in series: Loss of continuity in a series which has been published regularly creates a problem for the user. If the inter- ruption is brief such as the gap in the recording of bank debits caused by the "bank holiday'' in March, 1933, simple interpolation may be all that is needed to resolve the difficulty. There are other cases of failure to publish which are less easy to overcome. For example, from July, 1933, to February, 1935, inclusive the Post Office Department found it inconvenient to release for current publication the figures for postal receipts in "Fifty Selected Cities" and in "Fifty Industrial Cities." Such a prolonged suspension brings to a halt any statistical work involving use of the missing data. Even a slight experience will afford enough background to insure that many of the inconsistencies in published data will be recognized. Beyond that lies the task of detecting the less obvious discrepancies. Two things arc necessary for this, the first is varied experience in col- lection, the second is the exercise of common sense. The latter might be defined as a combination of experience, judgment, and figure perception. Cross-Reference. In many cases only one source can be found for a required set of data and no verification by cross-reference is possible. Frequently, however, similar data are collected by several agencies. In these cases all of the sources should be found as a means of de- termining which is most complete, which contains the data in most usable form, and which has the best general record of reliability. It is also desirable to get the most recently published source so that any corrections or revisions of the data will be discovered. If the record coincides in all of the sources, that fact gives added confidence in the accuracy of the data. If differences appear, the necessity of reconciling them arises. Discrepancies of the types enumerated in the preceding section may be involved or fundamental differences in the method of collection may be uncovered by study of the notes accompanying the tables. If inconsistencies arise which cannot be explained, it is neces- sary to search for collateral sources or perhaps to write to the collecting agency for further information. The process of comparing the data in several sources is known as cross-reference. An example of the use of cross-reference will serve 218 BUSINESS STATISTICS to demonstrate the method and its advantages. Suppose that the fol- lowing problem were proposed on June 1, 1937: "Collect data on annual production of steel ingots for the years 1932-1936, inclusive." The information obtainable from four sources is shown in Table 25, columns 1 to 4. The four reports contain different figures despite the fact that the original source of all four sets of data is the American Iron and Steel Institute. The title of the table from the Statistical Abstract states that steel ingots and steel for castings are included. Since the problem asks for steel ingots only, these data would not be satisfactory, even though the figure for 1936 could be supplied from current sources. An exam- ination of the March, 1937, Survey of Current Business in which the tonnage for steel ingot production and castings is given separately in- TABLE 25 PRODUCTION OF STEEL INGOTS IN THE UNITED STATES, ANNUALLY 1932-36, AS REPORTED IN FOUR SOURCES (thousands of long tons) YBAR STATIS- TICAL ABSTRACT* (1) ANNALIST (2) STEEL!! (3) STEEL YEARBOOK OF INDUS- TRY** (4) REVISFD SERIES HESSEMER AND OPEN HEARTH PRODUCTION (5) 1932 13,681 13,323t 13,464 13,323 13,323 1933 23 232 22,594f 22,894 22 594 22,594 1934 26,055 25,5991: 25,949 25,599 25,599 1935 34,093 33.426S 33,940 33,418 33,418 1936 46,9 19H 47,513 46,808 M936 issue, p 705. t December 7, 1934. p. 790. ^February 14, 1936\ p. 270. Apnl 10, 1936, p. 549. I February 12, 1937, p. 277. IfMay 10, 1937, p. 32, second table, "Annual Steel Ingot Production." "January, 1937. p. 360, "Steel Ingot Production, 1917-1937." dicates that none of the other three series, Table 25, columns 2 to 4, includes castings. Further study is needed, however, to reconcile the differences in these three series. The Annalist data correspond to the Steel Yearbook through 1934, but differ in 1935. If the Annalist for October 9, 1936, instead of April 10 is used, revised monthly data are found for all of 1935 which agree with those in the Yearbook, leading to the conclusion that the 1936 Annalist figures will likewise be revised later in 1937. It can now be concluded that these two series coincide, with only the revised 1936 figure lacking. Headings and foot- notes to the respective tables indicate that both include only Bessemer and open-hearth processes. THE USE OF LIBRARY SOURCES 219 The data from Steel, column 3, are classified in the original source according to processes including crucible and electric as well as Bes- semer and open-hearth. The difference between this series and the other two is explained by the inclusion of production by the crucible and electric processes. If a subtotal for Bessemer and open-hearth processes only is computed from the original table, the results coincide through 1935 with those from the Annalist and the Steel Yearbook. Since the May issue of Steel was published later than either the Year- book or the Annalist, one can assume its 1936 figure is the more recent revision. The series for steel ingot production by open-hearth and Bessemer processes can therefore be completed as shown in column 5, and there is now no disagreement among the three sources. There are, however, two complete series to choose from column 3 which includes crucible and electric production and column 5 which does not. Since the reports issued by the American Iron and Steel Institute usually include open- hearth and Bessemer only, column 5 appears to be the most desirable series to use. There are two major advantages in conducting this search: (l) the determination of the best figures to use for steel ingot production and (2) the collateral knowledge acquired concerning methods of recording data on steel ingot production. Evaluation of Data Evaluation deals not so much with the accuracy of data as with their validity. The question is: Are these data satisfactory for the purpose for which they are to be used? The answer can be obtained by understanding the background of the collection and by visualizing the collection process. Understanding the Background. Data come to exist either as a by-product of non-statistical activity or directly for statistical purposes. There are many examples of series of data which are collected for statistical purposes. The work of the Bureau of Census, the Bureau of Labor Statistics, and the Bureau of Agricultural Economics is carried on for the purpose of providing numerical information for general use. The purpose is directly statistical. Illustrations of data secured as a by-product of other activity are gasoline consumption by motor vehicles and cigarette consumption, both obtained by the Bureau of Internal Revenue in the course of 220 BUSINESS STATISTICS collecting the taxes levied on these articles by the government. Further examples are a census of employment, which might be tabulated from the registration cards for retirement annuities filed with the Social Security Board by employed workers at the end of 1936, and an index of grocery prices which might be computed from the newspaper adver- tising of grocery stores. By-product data are collected for some official or business purpose. Once they have served that purpose the collectors have no further interest or at most only a collateral interest in them. They may be kept in poor form; errors corrected for the major purpose may be omitted from the statistical record; there may be overlaps and omis- sions which creep in because the statistical record has not been ade- quately checked; the data may not be in usable form for statistical purposes, although serving the major purpose well. Since the data from by-product sources are likely to contain inaccuracies, it is desirable wherever possible to cross-check them in a direct statistical source. Visualizing the Collection. This means asking the question: How were the data collected? By answering this question considerable in- sight will be gained concerning the difficulties that were encountered in collecting the data and consequently a fair basis may be obtained for judging their reliability. An example will show what is involved in visualizing the collection. The United States Department of Agriculture publishes estimates of wheat production annually. To collect complete information concerning the amount of wheat produced would involve canvassing each year more than half of nearly 7,000,000 farmers in the United States. This would be a long and costly task and even if it were possible to do the work the results would contain some error because many farmers have no accurate record of the size of their wheat crop. Hence the Department of Agriculture makes no attempt to collect complete data annually. There are crop reporters in all parts of the country who voluntarily send in estimates of the number of acres planted in wheat in the sections their reports cover. Only a small part of the wheat acreage in the country is thus reported, but by applying the estimates to unre- ported areas statisticians are able to calculate the acreage planted in wheat for the entire country. Then at harvest time the same crop reporters send in estimates of the average yield per acre in their ter- ritories. By multiplying the estimated acreage by the estimated yield per acre the approximate production can be obtained for each section THE USE OF LIBRARY SOURCES 221 of the country. The total of these sectional estimates is the only annual production figure available for the whole United States. Every five years (ten years, prior to 1920) an actual census of production is taken and the estimates are checked against the census. Table 26 shows that the estimates varied from the census by more than 3 per cent on only two occasions and in four cases have varied by less than 1 per cent. Hence the conclusion is that the Department of Agriculture annual estimates of production are fairly accurate, but the margin of error inherent in the method of collection must be kept in mind when they are used. TABLE 26 COMPARISON OF DEPARTMENT OF AGRICULTURE ESTIMATES OF WHEAT PRODUCTION WITH BUREAU OF CENSUS COLLECTION* YEAR (1) DEPARTMENT OF AGRICULTURE ESTIMATE (IN BUSHELS) (2) BUREAU OF CENSUS COLLECTION (IN BUSHELS) (3) PER CENT VARIATION (1) -=- (2) - 100% 1879 459,234,000 459,483,000 .05 1889 504,370,000 468,374,000 -f-7.69 1899 655 143 000 658,534,000 .51 1 909 683 927,000 683,379,000 4- .08 1919 952,097,000 945,403 000 4- -71 1924 841,617,000 800,877,000 4-5.09 1929 823 217,000 800,649,000 4-2.82 1934 5?6 393 000 513,213,000 4-2.57 * Agricultural Statistics (1937), pp. 9-10. Example of Evaluation. Table 27 illustrates many of the pitfalls in the use of data and shows the method of evaluating data from the notes which accompany the table. One quickly detects from reading the several notes that the informa- tion contained in this table has variable accuracy. For some states the sales are determined by the number of tags addressed to consumers in that state by fertilizer manufacturers. If the counts are kept ac- curately, if the bags are all the same size, and if car-load shipments sent to retailers near state boundaries are distributed mainly in the state in which the retailer resides, then the tag count may give fairly good results. For other states estimates are made either by state authorities or by the National Fertilizer Association. Actual records of sales are compiled by state authorities for another group of states. For the year 1929 data collected by the Census of Agriculture in 1930 are used as the most reliable estimates of sales in some states but not in others. 222 BUSINESS STATISTICS TABLE 27 FERTILIZER: ESTIMATED SALES IN THE UNITED STATES NOTE. Data are based on fertilizer tag sales for some States and are compiled by State authorities from sales records, etc., for others, as indicated by footnotes. For 1929, census data have been used in many cases. Other figures are estimates made by State authorities or the office of the National Fertilizer Association. (In tons of 2,000 pounds) DIVISION AND STATE 1928 1929 1935 (prel.) United States 7,985 019 8 078 548 6 191 321 New England 365 119 357465 282 503 Maine 178,750 185,650 125,000 New Hampshire* 16,900 ll,500f 16,000 Vermont^ 16,911 14,905 15,295 Massachusetts *$ 70,458 68,61 If 63,208 Rhode Island 10 100 7,909f 12 000 Connecticut 72 000 68,890f 51,000 Middle Atlantic 743,558 798,433 658,874 New YorkJ 260 000 287,959f 234 000 New Jerseyjj 143 574 162 36lf 149 408 Pennsylvania^ 339,984 348,113f 275,466 East North Central 755,711 820,402 658,696 Ohio* 320,866 338,662 306,509 Indiana^ 221,082 250,201 194,946 Illinois!^ 30 509 38,056 23,827 Michigan 150,213 152,812 105,000 Wisconsin^ 33,041 40,671 28,414 etc. Year ended June 30, except data for 1929. t Agricultural census. t Compiled by state authorities, except as noted. 8 Year ended March 31, except data for 1929. JYear ended October 31. Based on tag sales. Source: The National Fertilizer Association, Statistical Abstract (1936), p. 598. Certain other peculiarities should also be noted. In New Hampshire, Massachusetts, Rhode Island, and New Jersey there is a discontinuity between 1928 and 1929 data. For example, for New Hampshire the 1928 data cover the period from July 1, 1927, to June 30, 1928, whereas the 1929 data cover the calendar year, January 1 to Decem- ber 31. Hence the table contains no record of sales in these states dur- ing the second half of 1928. A further difficulty appears in footnote ||. Presumably it should read ''Year ended October 31, except data for 1929," since footnote f on the 1929 data for New Jersey shows that they are census data and we know that the Census of Agriculture cov- ered the calendar year. Finally, revised 1935 figures are to be found in the 1937 Statistical Abstract. The detailed analysis of the notes accompanying this table indicates the method of evaluating data in terms of the background and sur- THE USE OF LIBRARY SOURCES 223 rounding circumstances. Footnotes and headnotes should always be studied carefully to discover what explanations the author of the table believed necessary to its comprehension. To disregard such notes is direct failure to use the available means of verification and evaluation of the data in the table. Transcribing the Data The final step in the collection process is to transfer the data from the source to a collection form. Although this appears to be a purely routine matter there are certain rules which, if observed, will help to avoid trouble later. Always assemble all of the data befoa doing any copying. Too frequently it has been the authors' experience that students bring in part of a series of data and ask for advice on how to complete the series only to find that it cannot be completed and a new series must be found. Until all of the data have been found there is no way of knowing whether partial discoveries will be usable. In transcribing data always start with the publication of most recent date and work back to the earlier dates. When revisions have been made from time to time, the best way to discover them is by com- paring data in the latest publication with overlapping data of an earlier publication. For example, if it is necessary to obtain data from 1929 to 1940, inclusive, and data from 1933 to 1940 are found in one issue of a publication, then the latest issue containing data from 1929 to 1933 should be used to complete the series. Data for 1933 appear in both issues and should be compared to insure that no change has occurred in the recording and that the same series is being taken from both issues. In another case the data may not agree in the two issues. Then three possibilities arise: (1) Explanations accompanying the tables may state the nature of the revision involved and how to make the series comparable in the two issues. (2) No explanation of the change may be given and it will be necessary to find another source containing the same series in comparable form or a substitute series that will serve the purpose. (3) Failing in both of the preceding expedients, the search may have to be abandoned. Difficulties in matching series in different issues of a source book occur most frequently as the result of shifting the base of an index. Such a revision can usually be adjusted unless an accompanying change in the method of constructing the index entirely destroys the comparability of the two parts of the series. 224 bUMNtsbb MA11&11C5 APPENDIX B EXAMPLES OF SEARCH FOR DATA IN LIBRARIES These examples are intended to show how a student 15 proceeded to find six series of data for problems which were assigned to him. Example I: Find the monthly freight car loadings by commodity classes for the ten-year period 1927-36. 1. Thought that material would be in the Statistical Abstract of the United States, but found that although theie were data for freight car loadings, the data were not in monthly form. Made a mental note of this remembering that most series of figures included in the Abstract were indicated by years. 2. Looked in the index to the Survey of Current Business and found the necessary material by individual classes and also by months for a particular period. By searching through the back issues or through the Supplements, I found the material available for the full ten years. 3. Tried the Federal Reserve Bulletin also and found that the same material was included in their monthly issues under the topic of industrial activity. Example II: Find the monthly indexes of employment and payrolls in the United States for the five years from 1932 to 1936, under this particular heading "Retail Trade General Merchandising/' 1. Discarded the thought of using the Statistical Abstract, because I wanted monthly figures. 2. Reached for the Survey of Current Business and found under the heading of employment and the heading of payrolls that monthly data were available for "Retail Trade," but that no distinction was made for "Retail Trade General Merchandising." 3. Because of the nature of the topic, I tried the Monthly Labor Review and by chance picked up a monthly issue dated in 1933. Here again the data were available but no distinction was made between General Merchandising and other merchandising. Searched further and found that the January, 1937, issue made this distinction in their current data. In looking through issues for 1936 and 1935, I found the entire series dating back to 1932 in the January, 1935, issue. Example III: Find the Freight Tonnage Originating on Class I Steam Rail- ways in the United States by quarters from 1927 to 1936. Designate the fol- lowing commodity groups separately: Products of Agriculture, Animals and Products, Products of Mines, Products of Forests, Manufacturers and Miscel- laneous, All L.C.L. Freight. 1. Looked in the Statistical Abstract of the United States and found data on freight tonnage, but the data were not in commodity groups nor by quarters. The figures showed tons of revenue freight carried. 2. Tried the Yearbook of Agriculture and found data entitled Freight Tonnage Originating on Railways in the United States and also the correct commodity groups, but the data were annual figures. 3. The Minerals Yearbook did not have the data in the correct form. And I didn't try the Survey of Current Business nor the Monthly Labor Review, nor the Federal Reserve Bulletin because of their particular use of monthly figures. Of course, this is not always true, but the particular nature of the sources led me to believe that the data were not available in them. 4. Because most of the previous data had been compiled by the Interstate Com* merce Commission, I looked in the card catalogue for particular oulletins or statements 18 Reports written in 1937 by Robert Berner, then a sophomore in the School of Business Administration of the University of Buffalo. THE USE OF LIBRARY SOURCES 223 published by the Commission or other independent establishments under the heading of "Commercial and Industrial," or publications of the Department of Commerce. A good index of government publications is the Monthly Catalogue of Public Documents. In the 1935 issue, I found under the topic of freight commodity statistics that the Bureau of Statistics of the Interstate Commerce Commission sent out statements quarterly giving Freight Statistics on Class I Steam Railways in the United States which included the total freight tonnage by the commodities indicated. Example IV: Find the total market value of all listed stocks on the New York Stock Exchange by months from 1926 to 1936. 1. Tried the Federal Reserve Bulletin and found certain data about security markets giving the indexes of stock prices by months, but the data required were not found in this source. 2. Did not attempt looking into the Monthly Labor Review, Statistical Abstract, Minerals Yearbook or Agricultural Yearbook because of the nature of the subject and problem. 3. Found in the Survey of Current Business the stock prices of all stocks, sales, yields, and other information but not the total market values of all listed stocks. At this time, I thought that if I had the base figure and could compute the actual figures for sales and stock prices, that by multiplying the two figures I might have usable figures of market values. This procedure would not be too accurate, however. 4. Looked in the card catalogue for government publications other than the original six, but found nothing dealing with the subject except material which had been used in the Federal Reserve Bulletin and the Survey of Current Business. 5. Found in the card catalogue that the New York Stock Exchange published a monthly bulletin. Upon getting a copy I found the acceptable material. Example V: Find the yearly production of steel rails from 1919 to 1936 by the following processes of steel manufacture: open-hearth, Bessemer, crucible, electric, and all others. Compare and explain the variations in each series. 1. Went immediately to the Statistical Abstract of the United States expecting to find the yearly figures. I did find figures giving the total rail production from 1914 to 1935, but they did not indicate the four distinct processes. 2. The Survey of Current Business contains monthly data on track work production, but does not contain the production according to processes. 3. Exhausted the content indexes of the Agricultural Yearbook, Minerals Yearbook, Federal Reserve Bulletin, and the Monthly Labor Review, but found nothing suitable to my needs. 4. Tried the other government sources, looking in the card catalogues and Index to Government Publications but found no discrimination in the processes used in producing steel rails. They did contain, at least a few sources contained, the total rail production figures by years, their source being the American Iron and Steel Institute. 5. Picked up t copy of the Annalist, finding only the tons of rails ordered by months as taken from the Railway Age magazine. This was not satisfying, nor were the data in the Commercial and Financial Chronicle. 6. Attempted to find the material in the technical magazine Steel. I found the data contained month by month quite inconsistent except for their index of business activity. Each month they include a new set of data recurring irregularly. Looking through several copies, I decided that the material I wanted was not included. 7. Tried the magazines Railway Age and Iron Age but found nothing satisfactory in either. The data in Iron Age are quite consistent, appearing in each monthly issue, showing the last two months, absolute figures. Their index of capital goods is one that is quite widely known and widely used. It shows weekly variations. Many of their figures on steel production and output were taken from the annual statistical report of the American Iron and Steel Institute. 8. I tried this annual report, finding an abundance of data on steel production by orocesses. one set of which contained the data I wanted. 226 BUSINESS STATISTICS Example VI: Find the dollar value of department-store sales, annually from 1927 to 1936 for the United States. 1. Glanced through the Agricultural Yearbook, Minerals Yearbook, and the Monthly Labor Review, and as I expected, found nothing pertaining to the required data. 2. Realizing that the Survey of Current Business and Federal Reserve Bulletin usually contain monthly data, I nevertheless found the data on department-store sales by months in index form. By so doing, I hoped to find the original source of the data. The data, appearing in forms adjusted for seasonal and unadjusted, were compiled by the Board of Governors of the Federal Reserve System, Division of Research and Statistics. These figures represent monthly dollar sales for a sample of approximately 425 stores. 3. Tried the Statistical Abstract of the United States, finding the dollar value of department store sales (including mail-order sales) for 1929 and 1933. From the various distinctions given to retail stores, I suspect there is a difficult problem in determining what kind of store may be classified as a department store. The Abstract also has an index of yearly department-store sales from 1919 to 1935 which is also a sample of from 400 to 560 stores compiled by the Federal Reserve Board. The actual dollar sales are available for 1929 and 1933 because of the census reports made by the Census Bureau. The original source of these figures comes from the Fifteenth Census of United States: Distribution. 4. From the Census of Business, I found the dollar value of net sales for 1933 and 1935. This source divided department-store sales into independents, chains, mail-order, commission or company stores, and all others. 5. Found nothing in the Federal Document Index nor card catalogue that would lead me to the desired data. 6. The Industrial Arts Index in the Buffalo Public Library showed several sources of statistical data pertaining to department-store sales, some of which are included in the foregoing steps. Mr. C. M. Schmalz of the Harvard University Graduate School of Business Administration published a report containing the "Operating Results of Depart- ment and Specialty Stores in 1935." Yet these data were not wide enough in scope, either in number of establishments or variety of years. 7. The Commercial and Financial Chronicle contains monthly statistics of department- store sales in index form, taken again from the Federal Reserve Board. 8. Looked through various technical sources and card catalogues indicating subjects of technical magazines and books, but found nothing about dollar value of department- store sales, annually from 1927 to 1936. 9. Concluded that data were not available in published sources. PROBLEMS 1. The answer to each of the following questions is to be found in a com- monly used government source, (a) Give answers to the questions (as assigned) with exact reference to the source, (b) Describe the steps you followed in each case in order to locate the data. (1) The percentage of increase in population for the United States and for California, from 1920 to 1930 and from 1930 to 1940. (2) The total number of strikes in progress in the United States during the month of August of last year; the number of workers involved; and the number of man-days idle during the month. (3) The wholesale price per bushel of No. 2 hard winter wheat at Kansas City, for the most recent week. (4) The number of dozen pairs of women's full-fashioned silk hose ex- ported from the United States during June of last year. THE USE OF LIBRARY SOURCES 227 2 The answer to each of the following questions is to be found in a com- monly used non-government source, (a) Give answers to the questions (as assigned) with exact reference to the source. (b) Describe the steps you followed in each case in order to locate the data. (1) The number of new passenger car registrations for Ford, Chevrolet, Plymouth, and Cadillac in November of last year. (2) The number of business failures in retail trade in the United States during the month of May each year since 1939- (3) The percentage of American-made passenger cars sold outside the United States in 1935; motor trucks; total. (4) The gross federal debt as reported by the United States Treasury as of 3 days ago. 3. State which of the steps of library search were employed in each of the examples in Appendix B. 4. Write a report of discrepancies found in the bank clearings reports in certain cross-referencing issues of the Commercial and Financial Chronicle. For example, the following issues will serve the purpose: January 18. 1941, February 22, 1941, etc., at monthly intervals. 5. A cursory search and attempted verification of the production of pig iron in the United States in 1937 produces the following figures: Steel, Yearbook of Industry, January 1, 1940 36,709,000 gross tons Statistical Abstract, 1939 35.224,000 long tons Survey of Current Business, 1938 Annual Supplement 3,051,000 long tons (monthly average) World Almanac, 1939 36,130,000 gross tons Standard Trade & Securities, Statistical Bulletin 100,300 gross tons (daily average) Pentorfs Almanack 1940-41 41,114,000 net tons Which of these figures would you choose? Give reasons for your choice by consulting these several sources; explain as completely as possible the apparent discrepancies. 6. From any issue of the Commercial and Financial Chronicle select a series of data that are of the by-product variety. Explain the major purpose for which the data were collected. Evaluate the data. 7. Before using library data, what facts would you desire to know about a) the nature of the data themselves? Why? b) the types of units in which the data are expressed? Why? c) the organization collecting or preparing them? Why? d) the purpose for which they were issued? Why? e) the consumers to whom they are addressed? Why? /) the accuracy of the data? Why? g) the homogeneity of the conditions under which the data were col- lected or to which they refer? Why? 228 BUSINESS STATISTICS 8. Certain difficulties of collection occur in each of the following problems. Find as much information as you can in answering the question and explain the circumstances in the sources that make it difficult to secure complete and comparable data. (The instructor will assign one or more of the problems to each student, according to the time available.) a) An important measure of steel ingot production is "per cent of ca- pacity." Trace the changes in "capacity" since 1889. b) Compare the number of savings banks, depositors, and amount of savings in your own state with the United States as of recent date. c) Compare the changes in the number of employees in the carriage in- dustry and in the automobile industry at 10-year intervals beginning with 1900. d) What was the payroll of the executive branch of the United States government annually, 1929 to date? e) Compare the number of full-time employees in one-, two-, and three- store independent groceries in the United States with the number employed in chain grocery stores, in 1929 and in 1935. /) Select the five industries whose indexes of employment were lowest during the most recent month, and compare these indexes with their indexes in 1929 and in 1932. 9. Could Table 21, page 168, and Table 28, page 233, be used for verification by cross reference? Give reason for your answer. REFERENCES LANE, MORTIMER B., How To Use Current Business Statistics. Washington, D. C.: Government Printing Office, 1928. The entire pamphlet should be read to obtain background knowledge concerning published statistics. Chapter IV is particularly valuable. SCHMECKEBIER, LAURENCE F., Government Publications and Their Use. Washington, D. C: The Brookings Institution, 1939. This book tells how to find government documents. It is useful to statis- ticians in locating less familiar publications. CHAPTER XI RATIOS THE IMPORTANCE OF RATIOS IN STATISTICS A" [ONG all statistical techniques none is so commonly used as the ratio. For instance we speak of having a national debt of $323 per capita; banks paying 2 per cent interest on sav- ings deposits; a retail merchant making a gross profit of 25 per cent on the cost of goods; sales 10 per cent above those of last year; a death rate of 11.0. Such ratios serve the twofold purpose of (1) simplifying data and (2) increasing their comparability. Ratios properly presented are so easily understood that an analysis of methods seems almost unnecessary. However, when the student changes from the role of a reader to that of a statistician who must transform primary data into ratio form he finds himself confronted with some problems. There are certain principles that determine the construction, presentation, and interpretation of statistical ratios. An exposition of these will form the content of this chapter and the next. CONSTRUCTION OF STATISTICAL RATIOS Statistical ratios are fundamentally the same as the ratios with which everyone becomes familiar in studying arithmetic. In chapter II on "The Use of Numbers," no particular mention was made of ratios, since from the point of view of arithmetic computations they are the same as any other fractions and are handled as such. However, since statistical ratios deal always with concrete values or quantities rather than with abstract numbers, certain modifications of the arithmetic con- cept of the ratio should be noted at the outset. These include the form of expression, the importance of the item used as the base, the number of units in which the base is expressed and the possibility of relations between unlike as well as like items. Form of Expression A ratio in arithmetic is the relation which one number or quantity has to another, its value being expressed as the abstract quotient of the 229 230 BUSINESS STATISTICS first divided by the second. The term "ratio" is applied either to the original fraction, to the quotient or to both stated together. In statistics the ratio relationship is always between two precisely defined concrete quantities or values. This relationship is simplified through the divi- sion of the first term by the second, but it is never expressed as an abstract quotient. Instead the value of a statistical ratio is the simplified value of the numerator expressed in relation to one or more units of the denominator. For example, the items used in the first ratio quoted were the na- tional debt of the United States of $42,558,875,571, March 31, 1940, and the population as of April 1, 1940, 131,699,275 persons. Dividing the first number by the second gives a quotient of 323, but this is not the statistical ratio. It must rather be stated as follows: "In 1940 the national debt of the United States was $323 per capita" or "$323 for every person in the United States." The qualifying descriptions of both numerator and denominator items must be either fully stated or clearly understood. When statistical ratios are listed in a table the exact specifications of both numerator and denominator are indicated in the table headings. Selection of Base In forming ratios between two abstract numbers either one may be used as the denominator or base, e.g., 5 -r- 20 = i or 20 -f- 5 = 4. Likewise an abstract quotient can be readily understood either in the form .0126, 1.26, or 126; hence there is no occasion for changing the number of units used in the base. However, in every statistical ratio the numbers represent definite concrete items and consequently two questions must always be considered: (1) Which item is the logical base? (2) In what number of units shall it be expressed? These two points must be discussed separately. The Item. The denominator of a statistical ratio is always a stand- ard to which the numerator is being compared. The two numbers each refer to concrete values or quantities whose characteristics require that one of them should be used as the standard in terms of which the other is to be measured. In some types of ratio construction it is immediately obvious which of the two items is the appropriate base: a) In a comparison between a part and the whole, the whole is always the base. RATIOS 231 b) In time comparisons between a recent and a prior recording of like items, the prior event will almost always be taken as the base. c ) In a comparison between an effect and its cause or between two values or events one of which is at least partly dependent upon the other, the cause or the independent item is always the base. In certain other types of ratios the choice of item for the base depends upon the use that is to be made of the ratio: a) In comparisons between like totals or between two parts of the same total, either one may be selected as the base according to the emphasis desired. b) In various accounting ratios such as sales divided by inventory, custom has determined the form that is used. The Number of Units. The number of denominator units used as the base may be determined by custom, convenience, or effectiveness. Referring again to some of the first examples quoted in this chapter, the national debt is expressed in terms of one denominator unit so many dollars for each single individual in the population; an interest rate of 2 per cent means two dollars for every hundred dollars de- posited; the death rate indicates the number of deaths during a given period for every thousand persons alive at the beginning of the period. These examples illustrate the practice of expressing the numerator of a statistical ratio in terms of: (a) one unit of the base, (b) 100 units of the base, (c ) other powers of 10 units of the base. One denominator unit as the base: There are many examples in which the base of a ratio is expressed as a single unit. All per capita ratios use one person as the unit of the base. In agriculture we use production per acre; in railroading revenue per ton-mile and per pas- senger-mile. The accountant uses a 2 to 1 ratio between current assets and current liabilities as a standard of liquidity, and such examples might be listed indefinitely. The expression of the numerator of a ratio in terms of one unit of the denominator as the base is accomplished by the application of a simple proportion in which x = the desired value for the numerator, numerator: denominator = x : 1 The solution for x requires simply that the numerator be divided by the denominator. The result then becomes a simplified value for the nu- merator in terms of one denominator unit, similar to that determined for the national debt per capita. 232 BUSINESS STATISTICS One hundred denominator units as the base: Most of the compari- sons made by the lay user of statistics are in terms of per cents. Thus we have a 5 per cent increase in grocery prices, a 3 per cent increase in bank deposits, the grades of a class of students 2 per cent below the average, a selling price which is 130 per cent of the cost, a wheat crop which is 85 per cent as great as last year, humidity of 75 per cent and so on. In each case the number stated as a per cent indicates how many numerator units there are for every hundred denominator units. 1 An illustration of the method of expressing a ratio in terms of 100 units as a base may be taken from Table 21, page 168. Column 1 gives the number of telephones in use during each year from 1931 to 1936. To find the ratio of telephones in use in 1936 to those in 1931 the formula that was used before could be applied but the result would then be ' or .94 telephones in 1936 for every one in 1931. Since -L J j it is difficult to visualize .94 of a telephone, a base of 100 units instead of one should be chosen. 14,454 : 15,390 = * : 100 The numerator was divided by the denominator as before but the deci- mal point was moved two places to the right. The result may be stated: "There were 94 telephones in use in 1936 for every 100 in 1931" or 'The number of telephones in 1936 was 94 per cent of the number in use in 1931." Other powers of ten denominator units as a base: Ten, 1,000, 10,- 000, 100,000, or even larger numbers of units may be used in the base. An advertiser may state that four out of every ten refrigerators sold last month were "Evercolds," the intention being to express the prefer- ence for the Evercold product even more vividly than would be the case if the advertisement stated that 40 per cent were "Evercolds." The use of telephones is expressed in the form, "number of telephones per thousand population. In a published chart dealing with automobile fatalities the following ratios were presented: deaths per 10,000 cars registered, deaths per 100,000 population, deaths per 10,000,000 gal- lons of gasoline consumed. Fish hatcheries study the propagation of fish in units of 1,000,000 fingerlings planted. The hazards of different 1 The construction and use of per cents may be reviewed by referring to chapter II. RATIOS 233 methods of transportation are expressed by comparing the number of deaths to the number of miles traveled in units of 100,000,000 miles. Similar usage appears in the field of vital statistics. The death rate for the United States in 1933 was 10.7 per 1,000 persons living in the entire area and the birth rate was 16.5 per 1,000 persons living in the area. When dealing with specific causes of death the base used is 100,000. Thus the 1933 death rate from cancer was 102.2 per 100,000 population in the United States, while the suicide rate was 15.9 per 100,000 population. There are two rules that determine whether one or some higher power of ten units should be used as the base: 1. The number used as the base should be large enough so that the value of the numerator will appear mainly as a whole number but will have not more than three digits to the left of the decimal point. In Table 28 the figures in column 4 are unwieldy. The rule for significant figures permits carrying these quotients to four or even five digits but one of the advantages of the use of ratios, simplicity, has been lost. The most effective form for these ratios is shown in column 3 in which the results appear as whole numbers of only three digits each. TABLE 28 ESTIMATED NUMBER OF TELEPHONES TN USE IN THE UNITED STATES, ESTIMATED POPULATION, AND RATIOS OF THE Two AT FIVE-YEAR INTERVALS, 1920-35* YEAR (1) ESTIMATED NUMBER OF TELEPHONES (000 omitted) (2) ESTIMATED POPULATION (000 omitted) (3) TELEPHONES PER 1,000 POPULATION (4) TELEPHONES PER 10,000 POPULATION 1920 13,329 106,543 125 1,251 1925 16,936 114,867 147 1,474 1930 20,201 123,091 164 1,641 1935 17 503 127,521 137 1,373 Statistical Abstract, 1936: Telephones, p. 344; Population, p. 10. 2. The number used as the base should be smaller than the number in the original denominator; otherwise the ratio implies more stability than is warranted. That is, a per cent should not be based on fewer used because in each month the base is less than 100. For instance, in than 100 cases. A ratio expressed as so many per thousand should in- methods of transportation are expressed by comparing the number of Similar usage appears in the field of vital statistics. The death rate for the United States in 1933 was 10.7 per 1,000 persons living in the 234 BUSINESS STATISTICS June each failure accounts for 14 per cent; hence one less or one more failure in 1937 would have caused the ratio either to be doubled or reduced to zero. TABLE 29 NUMBER OF BUSINESS FAILURES IN BUFFALO, NEW YORK, FIRST Six MONTHS OF 1936 AND 1937 MONTH (1) XT (2) NUMBER OF FAILURES (3) PERCENTAGE OF CHANGE IN FAILURES (2)-HD 100% 1936 1937 January 12 14 3 8 4 7 2 5 7 10 11 6 83 64 4-133 + 25 + 175 - 14 February March April May June The same criticism applies to the data in Table 30. A 100 per cent distribution has been computed from 22 cases. The column of per cents immediately conveys the impression that at least 100 accidents were involved whereas it really means that if 100 accidents had occurred about 36 of them would have been caused by mine cars, the estimate being accurate only to the extent that a prediction can be based on an experience of 8 cases out of 22. The computation of per cents to two decimal places in this table is further cause for criticism. It is spurious accuracy because the transfer of one accident to a different class (the minimum change possible in the table) would result in a change of 4.5 points in the affected ratios. For example, if there had been 9 fatal accidents due to mine cars and 9 in the miscellaneous class, then the per cent in each of these two classes would be changed to 40.91 per cent. Obviously there is no reason for carrying per cents to even one decimal place when they are based on so few cases. TABLE 30 FATAL ACCIDENTS AMONG OUTSIDE WORKERS AT BITUMINOUS COAL MINES IN PENNSYLVANIA, CLASSIFIED BY CAUSE, 1924* CAUSE OF ACCIDENTS NUMBER OF ACCIDENTS PERCENTAGE DISTRIBUTION OF ACCIDENTS 8 36.36 Railroad cars 1 4.55 Electricity 3 13.64 10 45.45 Total 22 100.00 Pennsylvania Departmental Statistics (Commonwealth of Pennsylvania, Department of State and Finance, Harrisburg, Pennsylvania, 1925), p. 139. RATIOS 235 Kinds of Ratios The basic definition of an arithmetic ratio includes the qualification that the two quantities must be of the same kind and expressed in the same unit of measure. It follows that whenever data are homogeneous they provide suitable material for ratio comparisons. However, before proceeding with the discussion of ratios between items of the same kind, an explanation is necessary concerning statistical ratios that are made up of unlike items. Ratios between Unlike Items. The possibility of such ratios in statistics was indicated in the first section of chapter VIII in the state- ment that one type of table may contain "several sets of information .... not expressed in the same unit, but they .... bear some rela- tion to each other" Arithmetic textbooks say, "We cannot express the ratio of a horse to a sheep," and "No ratio exists between five tons and 30 days." Yet even a brief experience in statistics shows that it is exactly such pairs of unlike items usually expressed in two different units that do provide the material for many statistical ratios in common use. Examples are: the rate of production per day or per acre, the income per capita, freight revenue per mile of railway, or bad debt losses per dollar of sales. Such ratios are permissible in statistics because, as previously noted, the statistical ratio is not an abstract quotient. Dollars of revenue are not actually divided by miles, nor bushels by acres. The statistical ratio is merely a simplified statement of a factual relationship that does exist in each case between numerator and denominator items. For example, the total number of bushels of wheat that is produced depends upon the total number of acres under production; hence it is justifiable to divide the first number by the second in order to arrive at a simpler figure which will indicate the average number of bushels produced per acre cultivated. Careful scrutiny is necessary in many cases in order to ascertain whether the items that are being compared are really like or unlike. This is particularly true of items measured in dollar value. The dollar appears to be the same unit whether it represents dollars of credit or dollars of sales; hence dollar values are readily combined in ratios. In other instances, a word such as "persons," "products," etc., may be used in both terms of a ratio, but unless the word is defined identically in the two terms the ratio is between unlike items and is subject to definite limitations. 236 BUSINESS STATISTICS In the construction and use of ratios between unlike items whether expressed in the same or different units there are three points of caution to be observed. 1. The numerator and denominator items although representing different objects or values must be identically defined in both time and space. Table 21, page 168, contains three different sets of data: tele- phones, messages, and revenue, each of the latter two being subdivided according to attribute. All three have the same time classification in the stub and since there is no space classification the spatial characteris- tic is identical for every item in the table, that is, each represents terri- tory served by the Bell Telephone System. In every horizontal row, therefore, the time and space characteristics are identical for all three sets of items and ratio relationships may be looked for between them. Thus in 1931 the ratio of the number of local messages to the number of telephones was ' ' or 1,475 messages per telephone; the 200 total revenue per telephone in the same year was^- - or $68 ; <ft i O^ 2. f\ the average payment per toll call in the same year was ' or $.33. Obviously one would not compare the number of telephones in 1931 to the revenue for a different year nor if state data were available would the number of messages in New York State be compared to the total number of telephones in New Jersey. A complete table such as this one is not always available when single ratios are being used, but it is always possible to reconstruct the table headings in outline in order to test whether the unlike items being used in any given ratio relation- ships do conform to this rule of identical time and space for numerator and denominator. 2. There must be a very definite relationship, causal or otherwise, between numerator and denominator. In each of the three ratios between unlike items quoted in the preceding paragraph the numerator is in some degree dependent upon the denominator item. The messages are dependent upon the tele- phones because the telephones must be used in transmitting the messages; operating revenue comes into existence only if telephone in- struments are in use; and the revenue from toll calls arises from the fact that the toll calls have been made. It is easy to assume ratio relationships in such cases as these without RATIOS 237 giving enough care to the definitions of the items used. The use of general terms may be correct in some ratios, whereas in other cases that appear to be similar a more specific term is needed to bring out the desired relationship. For example, the ratio "population per auto- mobile registered" might be used in measuring traffic density, but if the standard of living is being measured the ratio should be "registered passenger automobiles to population/' 2 These two ratios also illustrate the point that the purpose for which a ratio is being used will deter- mine which of the two items is dependent upon the other. The fact that a certain item such as population is used as the base in the majority of the ratios in which it occurs does not prove that it will be the base item in every case. 3. The relation between numerator and denominator must be cor- rectly expressed. A full and accurately worded statement of the relationship is im- portant in any type of ratio but especially so when the numerator and denominator represent different things. In particular, one is never a "per cent of" the other since in the case of unlike items one could not be any number of lOOths of the other. If the base is conveniently expressed in 100 units, the use of "per cent" is permissible, provided it is combined with "number," "value," or some corresponding expres- sion. For example, "The number of teachers is 20 per cent of the number of students," or "There are 20 per cent as many teachers as students," but certainly not "The teachers are 20 per cent of the stu- dents." An experiment to determine the effect of fertilizer upon wheat yield showed that the yield increased four bushels per acre when 100 bushels of lime were spread per acre. This statement involves two ratios between unlike items. Clearly it would be incorrect to say there was a 4 per cent increase either of the lime or of the wheat. Ratios between Like Items. Statistical data are considered "like" if they are expressed in the same unit and differ with respect to only one characteristic, according to the classifications that were used in the original tabulation. They may be alike in all attributes and in time, differing only in space; they may be identical in attributes and in space but different in time; or they may be alike in both time and space and in all but one attribute. This last group can be distinguished from the "unlike" data discussed in the preceding section which are also identical * See chapter XII for further discussion regarding refinement of definition in the construction of such ratios. 238 BUSINESS STATISTICS in time and space. Unlike items may even be expressed in the same unit of measure but they are differentiated from one another by separate definitions, which show that the items are in different categories. Referring again to the table of the Bell Telephone System, columns 4, 5, and 6 may be considered separately as a two-way table of like items. Total operating income in the entire system is subclassified ac- cording to attribute, local and toll, and is cross-classified according to time. Thus in any one row, the data in columns 4, 5, and 6 are alike in time and space and differ only in the one attribute according to which operating income has been subdivided. They are, therefore, "like" and may be compared. We may say, for example, that in 1931 the revenue from local service was ^- or 68.9 per cent of the total operating income. Similarly any figure in columns 4, 5, or 6 may be compared with another figure in the same column. They are alike in space and in attribute, differing only in time. Thus the toll revenue decreased from $326,300,000 in 1931 to $243,900,000 in 1933, a de crease of $82,400,000 or 25.3 per cent. Items that are listed under a single heading in a table are often potentially subject to further subdivision. For example, a single set of data headed "United States" might be subdivided according to the main geographic divisions, according to Federal Reserve Districts or according to the 48 states. "Total wage earners" might be subdivided into male and female. They might also be subdivided into age groups or by wage rates. The danger of comparisons between items that are too general in definition has already been noted in the case of ratios between unlike items, and the warning is equally applicable to ratios between like items. When subclassifications or refinements in definition are available, the maker of ratios should proceed with care before he looks for relations between general data that appear to be "like." However, his refinements can go no farther than the available data will permit. If, according to the classification used, the items are like in every characteristic but one, then they may be combined in ratios. But the possibility that the relations between such data might be af- fected by further subdivision of their characteristics must be kept constantly in mind in drawing conclusions from these ratios. This point becomes of special importance when comparisons are drawn between two or more ratios and will be discussed further in a later section. RATIOS 239 In conformity with the fundamental classifications of data, a ratio between like items may be classified as a time, space, or attribute ratio according to the one respect in which the numerator item differs from the denominator. A second method of classifying a ratio is according to whether (1) the numerator item is a part of the denominator item; (2) the numerator and denominator are separate parts of the same total; (3) the numerator and denominator items are separate totals. The mechanics of ratio construction will be discussed according to these part-total relationships and in each case any differences in the treatment of time, space, and attribute ratios will be pointed out. Part-to-total: This type of ratio is used chiefly in space and attribute comparisons. Items that differ in time also become material for part- to-total ratios when they are of such a nature that they make up a cumulative total, as for example monthly production figures for a given year. The method of construction and use of part-to-total ratios is identical for all three types of comparison. Part-to-total ratios take two common forms: (1) the comparison of a single part to the whole and (2) a percentage distribution in which all the parts are shown as percentages of the whole or 100 per cent. Single Ratios: Examples of single part-to-total ratios are the per- centage of manufactured products in the state of Michigan produced in the Detroit area; the number of factory workers over 65 years of age per 1,000 factory workers and the number of high-school graduates entering college per 1,000 high-school graduates. In each of these ex- amples the part selected for comparison with the total is chosen to demonstrate a particular point and so far as that demonstration is con- cerned nothing need be known about the other parts of the total except that they exist. The first example was a spatial ratio and the others were attribute ratios. When part-to-total time ratios are to be constructed, a distinction must be made between series in which there is no overlapping between the separate items and those in which the quantities or values do over- lap. In the first type the separate parts can be added to make up a total for a longer period of time, as for example, the sum of the ex- ports for each of 12 months in a year will equal the total year's exports and it follows that the data may be used in part- to-total ratios. Series like this are quite different in nature from those which are recorded at similar time intervals but which represent overlapping quantities or values Such time series as number of employees, number of acres 240 BUSINESS STATISTICS under production, population or assessed value of property cannot be added to form totals. Consequently no part-to-total ratios can be constructed from them. The telephone table (Table 21, page 168) may again be used to illustrate the contrast in these two kinds of time series. Columns 2 and 3 show the number of messages of a certain kind that were trans- mitted during each year from 1931 to 1936. Here there is no over- lapping every message counted in 1931 is distinct from those counted in each of the other years. If there were any special significance in the six-year period, the number of messages of each kind could be totaled and the ratio of any one year to the total period could be used. Columns 4, 5, and 6 which show the operating income for each year likewise consist of non-overlapping items and could be treated in the same way. However, the items in column 1, the number of telephones in use during each year, cannot be added to give a total. They are obviously overlapping data, since most of the 15,390,000 instruments in use in 1931 are also counted among the 13,793,000 used in 1932. Some new ones have been added while some of the old ones have been disconnected and like changes have occurred every year. Since the separate items do not constitute a total, no part-to-total ratios can be made from them. The only possible ratios would be those between two single figures in the same column, that is, total-to-total ratios. Time ratios of this kind will be discussed in a later section. Percentage Distributions: The same types of data that are suitable for single part-to-total comparisons can be presented as percentage dis- tributions. This is a ratio technique that gives emphasis to the relative importance of each of the parts that make up a total. The several numerator items are each expressed in terms of 100 units of the same denominator, the denominator being equal to the sum of the numerator items. Table 31 shows the amounts loaned on non-farm mortgages by different types of lending institutions during five months of 1939 with a percentage distribution of the several items. The per cent column shows more clearly than the original data that savings-and-loan asso- ciations were the most important lending agencies during this period, and that banks and trust companies were second. The least business was done by insurance companies and mutual savings banks with 8.8 per cent and 3.2 per cent, respectively, or only 12 per cent for the two combined. Table 32 presents a part-to-whole analysis which resembles the pre- RATIOS 241 TABLE 31 NON-FARM MORTGAGE RECORDINGS IN THE UNITED STATES BY TYPE OF MORTGAGEE, FIRST FIVE MONTHS OF 1939* TYPE OF LENDER VALUE of MORTGAGE RECORDINGS (000,000 omitted) PER CENT OF TOTAL RECORDINGS Savings-and-loan associations $431.8 30.1 Insurance companies 127.1 8.8 Banks and trust companies 359.2 25.0 Mutual savings banks 467 3.2 Individuals 263 7 18.4 Others 208.8 14.5 Total $1 437.3 100.0 * Federal Home Loan Bank Review, Vol. 5, No. 10 (July. 1939), p. 311. Federal Home Loan Bank Board, Washington, D. C. ceding one in form. At first glance it might appear to be another per- centage distribution in this case of items differing in space. The percentage living on farms varies from 2.4 in Rhode Island to 31.4 in Vermont and a total of these per cents happens to be about 100, which might be assumed to represent the total for the New England and Middle Atlantic States. However, a closer inspection shows that this is not a percentage distribution but a series made up of the first type of part-to-whole ratios. The separate ratios are not comparisons in space but of attribute, i.e., residence on farms is an attribute of a part of the population of each state. Each of the ratios has been com- puted from a different base, the total population of that state; there- fore they cannot be added to give a total that has any meaning. The percentage of the total population living on farms in all of the states together must be computed from the total original data, the same as was done for each separate state. TABLE 32 PER CENT OF TOTAL POPULATION LIVING ON FARMS IN NORTHEASTERN STATES, 1930 CENSUS* STATE PFR CENT LIVING ON FARMS Maine 21 4 New Hampshire 13 5 Vermont 314 Massachusetts 2 9 Rhode Island . 24 5 4 New York 5 7 New Jersey 3.2 8 9 Statistical Abstract. 1936, p. . 242 BUSINESS STATISTICS Errors To Avoid in Percentage Distributions: Sometimes per cents of a total are quoted that do not amount to 100 per cent, usually due to some kind of carelessness. This error can be avoided if all the per cents are quoted in tabular form, including the 100 per cent total. 8 Other examples, such as Table 33, may be found in which the total of a percentage distribution greatly exceeds 100, not because of error in the computations but because the table contains two distributions instead of one. In this case further confusion is added because in each of the two the data have been distributed in a double classification. Thus there is no clear distribution according to each characteristic separately. In the source from which Table 33 was taken the per- centages and primary data were all given in a single column which was even more confusing than in the form shown, since there was no indication which items together totaled 100 per cent. A more usable form for the data is shown in Table 34 which is practically equivalent to two separate tables. In this form comparisons are immediately apparent between the percentages normal and defective in the various categories. TABLE 33 CLASSIFICATION OF DEFECTS BY SEX AND NATIVITY FOURTH-CLASS SCHOOL DISTRICTS, PENNSYLVANIA, 1917-18* NUMBER PER CENT Total male 240,553 Normal 55,735 11.5 Defective 184,818 38.1 Total female . 244,455 Normal 63,858 13.2 Defective . . 180,597 37.2 Total native 464,034 Normal 115,671 23.9 Defective 348,363 71.8 Total foreign 20,974 Normal 3,922 0.8 Defective 17,052 3.5 * Departmental Statistics (Commonwealth of Pennsylvania, Department of State and Finance, Harrisburg, Pa., 1925), p. 72. This same type of error may appear in a number of different forms, in all of which the mistake lies in the attempt to show too much in one distribution. Percentages of subtotals should not appear in the same column with percentages of the total distribution unless italicized or otherwise unmistakably distinguished. It is preferable to make 8 For a discussion of significant figures in percentage distributions refer to chapter VIII, pages 168-69. RATIOS 243 several short tables, each showing one set of relationships clearly. Other isolated percentage relationships that do not warrant the con- struction of a special table may be pointed out in the text and in any such case the original data should be quoted along with the per cent. TABLE 34 PUPILS IN FOURTH CLASS SCHOOL DISTRICTS IN PENNSYLVANIA: NUMBER AND PER CENT NORMAL AND DEFECTIVE ACCORDING TO SEX AND NATIVITY, 1917-18 SEX NATIVITY Male Female Total Native Foreign- Born Total Normal 55 735 63,858 119 593 115,671 3,922 119,593 Defective 184818 180,597 365,415 348,363 17,052 365,415 Total 240,553 244,455 485,008 464,034 20,974 485,008 PER CENT Normal 23.2 26.1 24.7 24.9 18.7 24.7 Defective 76.8 73.9 75.3 75.1 81.3 75.3 Total 100.0 100.0 100.0 100.0 100.0 100.0 One of the most frequent misuses of a percentage distribution re- sults from the inclusion of a miscellaneous class. Such a class may contain (1) items which are known to be distinct from those included in the separate classes or (2) items that are unknown or poorly defined. 1. If a class is designated as "All other," "Others," or "Not else- where classified," it indicates that a number of less important classes of the distribution have been combined in order to conserve space, to concentrate the reader's attention on the important items or to avoid disclosing confidential information. The characteristics of all these other items are known and they definitely do not belong in any of the specifically named classes. No single class included among "Others" should be larger than the smallest class that is named sepa- rately, although the total of the combined "Others" may be greater. In Table 31, "Others" presumably includes endowment funds, non- profit institutions, etc., each of which is distinct from and less import- ant as a mortgage investor than the separately listed lenders of the table. Under such circumstances the information contained in the specific classes loses none of its accuracy by reason of the inclusion of a 244 BUSINESS STATISTICS miscellaneous class. A percentage distribution that includes the "Others" as one of the parts of the 100 per cent total will therefore correctly represent the relation of each part to the total. 2. In tabulating primary data it frequently happens that the answers to certain questions are missing from some of the collected schedules. Faulty questionnaire planning may likewise result in a group of poorly defined answers that cannot be classified precisely. Such cases must be grouped in an "Unknown" or "Not reported" class, although at least some of them should have been included in one or more of the known classes. The calculation of a percentage distribution with this unknown group as a component part of the total would therefore distort the true relation of each of the specific groups to the total. An alternative method of dealing with this situation will depend upon the circum- stances surrounding the collection of the data. a) If it can be assumed that the known cases comprise a repre- sentative sample of the total, the unknown group even if relatively large may be dropped and a percentage distribution computed of the total known cases. This is justifiable in any case if the unknown group is relatively small, since the omission of a few items from one or more groups will not materially affect the percentage relationships. A foot- note may be added stating the number of items omitted and what per cent they are of the total number investigated. b) If a large unknown group has resulted from some element of bias in answering the questions, the distribution of known items can not be assumed to be representative. In such cases no percentage dis- tribution should be computed and indeed the original data are of ques- tionable value. Table 30 illustrates a so-called miscellaneous class that is really un- known. The source from which the table was taken gave no direct or collateral information to indicate whether the ten accidents classified as miscellaneous were attributable to causes other than those listed or whether several of them were not allocated because of insufficient information. If the cases in the miscellaneous class are independent of the listed causes, then none of them should be more important than the listed causes. Since the table lists one accident involving railroad cars it would follow that ten different causes of one accident each are included in the miscellaneous class. While this situation is quite pos- sible it seems more likely that these ten accidents have been grouped in a miscellaneous class because of insufficient information to allocate RATIOS 245 them. If this is the correct interpretation, then the entire table is worth- less because the allocation of these ten cases to specific causes might change completely the distribution of cases in the three classes. Part-to-part and total-to-total: Ratios between two like items neither one of which is a total including the other may be either part-to-part or total-to-total. From the point of view of ratio construction they may be considered together since there is no essential difference in method. In the case of space ratios the difference is only in the point of view. The areas of Canada and the United States may be regarded as sepa- rate totals or they may be two of the component parts of the total area of North America. From either viewpoint the area of Canada to the United States is in the ratio of 106 : 100, or it is 106 per cent as great as the area of the United States. In time series when the data are non-overlapping they may be re- garded either as separate totals or as parts of a larger total; if they do overlap they are always separate totals. However, the method of comparing one item with another or with an average or other standard is the same in either case. Attribute ratios may appear to be comparisons between separate totals but if they are made up of genuinely "like*' items a broader definition can be found under which they will range themselves as two component parts of a larger total. If the two items that are being compared can in no sense be regarded as mutually exclusive parts of a total, then they are not attribute ratios of like items, even though they appear to be expressed in the same unit. They are instead ratios be- tween unlike items and are subject to the limitations already mentioned under that head. For example, the results of a study of radio advertis- ing yielded the following sets of data: total number of persons inter- viewed; number who listened to a given radio program; number who bought the product advertised on the program. All three of these sets of data used the general unit "persons." This unit had been subdivided in two ways: listeners and non-listeners; buyers and non-buyers. Rela- tionships between listeners and non-listeners, buyers and non-buyers or listener-buyers and listener-non-buyers were genuine part-to-part ratios. But "total listeners" and "total buyers" were not mutually exclusive categories under the general term "persons." Hence for the purpose at hand they were unlike items. A ratio between them would have been valid only if the number of one group were in some way dependent 246 BUSINESS STATISTICS upon the number in the other group, an assumption that would have been difficult to justify. Usefulness of Part-to-Part Ratios: Ratios between the several parts will frequently provide more ex^ct information than the ratios of each part to the total. In the field of vital statistics such ratios as the number of male births to the number of female births, foreign-born to native, urban to rural, and white to colored population are in common use. In these cases the corresponding part appears to be a more natural standard than the total of the two. Furthermore, the use of a small base emphasizes the degree of difference between the two parts more effectively than if each were compared with the total. The part-to-part ratio is equally advantageous in the field of business. Table 31 showed that out of every $100 of new mortgage loans $30 were made by savings-and-loan associations, and $25 by banks and trust companies. A part-to-part ratio would afford the more direct statement that only $83 was loaned by banks and trust com- panies for every $100 by savings-and-loan associations. Or a statistician employed in a mutual savings bank might state that for every $100 loaned by that type of bank $925 was put out by savings-and-ioan associations. This example brings out the point that the purpose of such ratios is the comparison of one item to another as a standard of measure; therefore either item may be used as the base according to the emphasis desired. Whether the part-to-part relation has greater significance than part-to-total will also depend upon the emphasis needed in each case. This becomes especially important when two or more sets of such ratios are being compared, usually at different periods of time. Percentage Relation: Since part-to-part ratios as well as part-to-total ratios are usually expressed in terms of per cents, precise terms must be used in expressing either kind of ratio in order to avoid ambiguity or misstatement. Furthermore in stating a part-to-part relationship, one item is no more a "per cent of" the other than is the case with ratios between unlike items. The sales of chain grocery stores and of independent grocery stores in a community for a given year might appear as follows: Chain-store sales $250,000 Independent-store sales 200,000 RATIOS 247 If a statement were made, "The independent-store sales were 80 per cent," this could be taken to mean 80 per cent of the two combined. "The independent-store sales were 80 per cent of the chain-store sales" would imply that independents were a part of the chains instead of an entirely different type of grocery. "The relation between the two is 80 per cent" fails to indicate which one is used as the standard. The following are some of the correct statements that can be made: "Independent-store sales were 80 per cent as great as chain-store sales;" "Sales in the chain stores amounted to 125 per cent as much as the independent-store sales." Percentage Difference: The relation between two parts is very fre- quently expressed as a percentage difference and may be computed by either of two methods: (1) by subtracting 100 per cent from the percentage relation, computed on either item as base or (2) subtracting the item selected as base from the numerator and dividing the re- mainder by the base item. Due regard must be taken throughout for algebraic signs according to either method. Using the same example of grocery store sales: 80 per cent 100 per cent = 20 per cent or 200 250 = 50 and """" = .20 or 20 per cent. 250 r Again the wording must be precise and the base must be clearly indicated. "The difference between chain-store sales and independent sales was 20 per cent" does not tell which type of store has been used as the base or which had the greater sales. "Sales in independent stores were 20 per cent less than in chain stores" is a much clearer statement; or, if independent stores are selected as the base, "Sales in chain stores were 25 per cent greater than in independents," or "ex- ceeded sales in independents by 25 per cent." Note that whenevei the base is changed the percentage difference will change in amount as well as in direction. Precision of statement is particularly necessary when the part-to-part or total-to-total ratios are time relationships. Differences between two items that are identical except in time are best expressed as per cents of positive or negative change, or per cents of increase or decrease, the methods of computation being the same as for deriving percentage dif- ference in space and attribute ratios. Table 35 provides examples of a number of time ratios in each of which an item in October, 1937, is compared with an identically 248 BUSINESS STATISTICS defined item in October, 1936. All of these examples are total-to-total comparisons rather than part-to-part, since the two months compared are corresponding parts from different years instead of parts of the same year. The first four indicators are non-overlapping series but the fifth represents overlapping data. Despite this difference the same kind of wording can be used in reading all of the ratios in column 4: "In October, 1937, the production of steel ingots showed a 25.2 per cent decrease in comparison with the same month of the previous year"; "The number of cotton spindles active in October, 1937, showed but little change since October a year ago, an increase of only .3 per cent." TABLE 35 INDICATORS OF BUSINESS ACTIVITY, OCTOBER, 1937 AND OCTOBER, 1936* BUSINESS INDICATOR (1) (2) AMOUNT OR VALUE (3) PERCENTAGE RELA- TION X 100 (4) PERCENTAGE OF CHANGE (1) X loT OR (3) 100% Oct., 1936 Oct., 1937 Steel ingot production (thous. tons) 4,534 44,274 43,321 226 23,662 3,393 107,216 40,040 202 23,724 74.8 242.2 92.4 89.4 100.3 25.2 4-142.2 - 7.6 10.6 + -3 Domestic auto, sales (Gen. Mot.) (number) Bituminous coal production (thous. tons) Building contracts (mill, dollars) Cotton spindles active (thous. spindles) * The Annalist, Vol. 50, No. 1295 (November 12, 1937), pp. 796-97 and Vol. 50, No. 1296 (November 19, 1937), pp. 836-37. New York Times Co., New York. If in Table 35 the comparisons had been with the previous month the first four might be regarded as part-to-part ratios but the wording would be no different except that September, 1937, would be named instead of October of the preceding year. Frequently the period used as a standard in constructing such ratios is not clearly indicated. A newspaper headline may read, "Department-Store Sales Jump Seven Per Cent in August," but careful reading of the article discloses that the 7 per cent gain was not since July of the same year, as one might assume, but since August of the preceding year. In this connection it should be noted that in comparing two time ratios, both of which are based on the same previous standard as 100 per cent, these ratios or "index numbers" are handled exactly as if they were primary data. That is, the relation or difference is found by dividing one by the RATIOS 249 other, not by subtracting one index from the other. In the example mentioned the index of department-store sales rose from 83 in August, 1938, to 89 in August, 1939, an increase of 6 points on a base of 83 which is at the rate of 7 per 100 or 7 per cent. Distinguishing between Percentage Relation and Percentage Differ- ence: There is seldom any difficulty in distinguishing between percent- age relation and percentage difference in either space, attribute, or time ratios provided the difference is less than 100 per cent. When the difference is large, it is easy to forge f that one item must be subtracted from the other to obtain the percentage difference. Using the previous example of grocery-store sales, suppose that two years earlier the sales of the chain stores were only $25,000. Then the sales for the later year divided by the sales for the earlier year equal + 10 or + 1000 per cent. The sales in the later year therefore were 1000 per cent as great as the sales in the earlier year. To obtain the difference, 100 per cent must be subtracted, leaving an increase of 900 per cent. Or, find the difference between the two years, that is, the later or base year minus the earlier year and divide by the earlier year: $250,000 $25,000 = +$225,000 +$25,000 = +9 or 900 per cent increase. Further illustrations can be found by comparing column 3 with column 4 in Table 35. As has already been indicated the base item in time ratios is prac- tically always the earlier period. Failure to observe this rule leads to still further confusion in the expression of percentage relation or per cent of increase or decrease, as illustrated in the following quotation: Making Hilarity Pay. The large majority of the bootleggers have now cut their prices from 200 to 300 per cent in a desperate effort to meet the competition of the State Liquor Stores. (Newspaper clipping.) The reader would assume from the word "cut" that an earlier period had been used as the base. Whatever the former price may have been, a cut of 100 per cent would reduce it to zero, hence any greater decline would mean that the bootleggers were paying the purchasers to take their wares. A decline or decrease can never exceed 100 per cent. Very likely what happened was that liquor formerly selling at $3.00 per quart was reduced to $1.00 or $.75. The difference of 200 or 300 per cent was found by using the later period as the base of the ratio. Assuming that the present price is $.75, the method should have been as follows: $.75 -f- $3.00 = .25 or 25 per cent. Subtracting 100 per 250 BUSINESS STATISTICS cent leaves 75 per cent. Thus the present price is i the former price or has decreased 75 per cent. There were two errors in the quoted statement: (1) the later instead of the earlier year was used as the base in computing percentage change and (2) the difference was incorrectly interpreted as percentage decrease instead of the per cent by which the past exceeded the present. There are a few occasions on which a later period may be used as the base, as when we say that the output of a plant was 10 per cent higher last year than this, or pre-war prices were 20 per cent below the current level, but examples of this kind occur so infrequently that they probably would better be disregarded entirely because they tend to confuse the unwary. PRESENTATION OF RATIOS Considerable attention has been devoted to proper presentation, both in text and tabular form, during the discussion of the construc- tion of various kinds of ratios. These rules need only be reviewed briefly together with some reference to chapter VIII and with certain additions regarding ratio presentation in general. In Text The following points should be observed in any textual reference to ratios: 1. The exact scope of both numerator and denominator should be fully defined unless very clearly understood. 2. The expression of each ratio should be precisely and accurately worded according to the kind of relationship involved, leaving no possibility of misunderstanding. 3. If a ratio that does not actually appear in an accompanying table is used in the text, the data from which the ratio is derived should be quoted along with it. In Tabular Form The following rules will be a guide in tabular presentation: 1. The rule of definite and adequate headings in presenting pri- mary data in tables also applies to ratios. If the original data are not included in the table the numerator and denominator items as well as the direction of relationship between them must be clearly defined in RATIOS 251 the table. If the data are listed in parallel columns, they may be referred to by column number, as in Table 35. 2. A separate derivative table should be made for each type of ratio comparison drawn from a given set of primary data. In particular, percentage distributions according to both horizontal and vertical classi- fications or according to more than one category should not be presented in a single table. 3. Every percentage distribution should include a 100 per cent total and the separate per cents must add to 100 except for having been rounded off according to the rule for significant figures. Carrying per cents to too many decimal places gives a false impression of accuracy. 4. Percentages of difference or change must be clearly indicated as positive or negative. 5. Whenever possible, the data from which ratios have been derived should be shown along with the ratios. Importance of Including Original Data The last-named point is of importance in presenting any type of ratio. In a complete presentation of any subject the original data will, of course, appear in primary tables and need not necessarily be repeated in every derivative table. In a small summary table, how- ever, there is little danger of too great complication if the data and ratios are arranged in parallel columns. For example, in Table 32 the meaning of the percentages would have been much more evident had two additional columns been given as follows: "Number of Families in State" and "Number of Families Living on Farms/' It should be remembered that the reader is rightly skeptical in accepting any statement of relationships that he cannot verify by making the computation himself. Table 36 contains a number of errors, but because the original data are given along with the ratios, it is possible for the reader to detect the errors and to correct them as well as to add his own interpretation. Two of the errors in Table 36 are typographical, such as may often be found as a result of lack of careful proofreading. One of these must be discovered in order to avoid misinterpreting the percentage distributions (note also the incorrect caption of this column), whereas the other is less serious. The first line of the distribution for June 30, 1933, as printed is really the first line of the distribution for June 30, 232 BUSINESS STATISTICS TABLE 36 AMOUNT OF PUBLIC DEBT DUB BEFORE AND AFTER JAN. 1, 1939, EXCLUDING PRE-WAR, POSTAL SAVINGS AND UNITED STATES SAVINGS BONDS AND SECURITIES ISSUED EXCLUSIVELY TO GOVERNMENT AGENCIES AND TRUST FUNDS DIVISION AMOUNT (IN MILLIONS) P.C. OF TOT. JUNE 30, 1932 Due before Jan. 1 1939 $10 870.7 602 First Liberty bonds (1947), called 1935 1 933 2 10 7 Due after Jan 1 1939 5 258 8 29 1 Total $18 062 7 100 JUNE 30, 1933 Due before Jan 1 1939 - $13 4584* 53 3t First Liberty bonds (1947), called 1935 1 933 2 9.2 Due after Jan. 1 1939 5 215.9 24.8 Total $21,028.4 100.0 JUNE 30, 1934 Due before Jan. 1, 1939 $3,458.4$ 53.3t First Liberty bonds ( 1947), called 1935 1 933.2 7.7 Due after Jan. 1, 1939 9,861.2 39.0 Total $25,252.8 100.0 JUNE 30, 1935 Due before Jan. 1, 19^9 $10,0008 38.3 First Liberty bonds ( 1947) called 1935 Due after Jan. 1, 1939 16,093.9 61.7 Total $26,094.7 100.0 * Should read $13,879.3. t Should read 66.0. * Should read $13,458.4. 1934. As indicated in the footnote, it should read: Due before Jan- uary 1, 1939, $13,879-3; 66.0. The error can be discovered by noticing that the percentage distribution for 1933 as printed adds to 87.3 per cent instead of 100 per cent. The second error occurs in the first line of the distribution of June 30, 1934, in which the amount is printed as $3,458.4 instead of $13,458.4. This error becomes obvious after the first one has been detected. If neither error were discovered, one would naturally read the table to mean that ten billion dollars of bonds had been retired or replaced by longer maturities during the fiscal year 1933-34. The corrected figures show that the reduction was only 421 million dollars. In this case there are more errors in the original data than in the per cents but they provide a check on each other. When ratios alone appear it is impossible to determine where the error lies. Except in the case of distributions totaling 100 per cent, the existence of error RATIOS 253 would not be apparent unless the figure obviously disagreed with known conditions. Referring again to Table 36, if the column giving the amount of the total debt were omitted, it would be very difficult by reading the per cents alone to comprehend the changes that have taken place. For example, the decline in the per cent of the debt due before 1939 from 66 per cent in 1933 to 53 per cent in 1934 might be ascribed to refunding operations during the year. The amounts show that the decline in per cent of early maturity was caused mainly by an increase in the total debt resulting from the addition of some four and one-half billion of longer term bonds maturing after 1939. In contrast to this the further decline from 53 per cent in 1934 to 38 per cent in 1935 of bonds maturing before 1939 was the joint result of a decline of about three and one-half billions in short-term maturities and an increase of over six billions in longer-term maturities. Thus we see how essential the amounts are in arriving at the proper interpretation of the changes in the per cents. Sometimes additional relationships can be derived from a given set of data. If the original data are not shown the reader is prevented from working out ratios which may be of more interest to him than those selected by the author. Full presentation of the original data is therefore evidence of good faith on the part of the author. The reader is free to check every statement and to work out his own interpretation. COMPARISONS BETWEEN RATIOS Large and unwieldy figures are reduced to ratio form chiefly because comparisons between two or more such ratios can be easily interpreted. Many relationships that are entirely obscured in the original data can be brought out through the correct use of compari- sons between ratios. In fact a comparison is so implicit whenever two or more related ratios are presented together that up to this point in the chapter it has been impossible to confine the discussion to single ratios and not to anticipate to some extent the relations that exist between the several ratios. Kinds of Comparisons These comparisons between ratios group themselves into two distinct types: (1) those between several ratios in a single series, all of which 254 BUSINESS STATISTICS have the same base, and (2) those between two or more separate ratios, in each of which the base is a different quantity or value. The first kind of comparison, with a very few exceptions, involves ratios between like items while in the second the ratios compared may be made up of like or unlike items. The methods used in the two kinds of comparison will be discussed separately. Ratios on the Same Base. There are two kinds of series in which several ratios are computed on the same base: (a) percentage dis- tributions of parts of a total and (b) index numbers in which successive values in a time series are expressed as per cents of an earlier year or some other normal base period. The primary purpose in presenting either of these types of ratios in a series is to show the importance of each individual item in relation to the base of 100 per cent. Again it should be emphasized that the construction of either type of series presupposes the homogeneity of the data. In a percentage distribution the separate parts must comprise a unified whole and in a time series the successive values must be identically defined from period to period. Percentage distributions: As was illustrated in Table 31, the expres- sion of each part as a per cent of the total makes it easy to estimate the relative importance of each part. Stating the several per cents, along with the 100 per cent base, usually expresses the comparison sufficiently without any further computation. The difference between any two such ratios may also be expressed by subtracting one from the other provided the relation of this difference to the 100 per cent base is clearly indicated. In Table 31, for example, "Only 5 per cent more of the total mortgages was held by savings-and-loan associations than by the group next in importance, banks and trust companies." If a direct relation between any two items had been of main importance without reference to the total, the percentage distribution need not have been made. It would have been simpler to divide the two items of original data. However, if only a percentage distribution is available without accompanying data, dividing one per cent by the other will give the relation between the two items, since the identical denominators cancel out. To express the importance of savings-and-loan associations rela- tive to banks and trust companies: ^ ' -4- ' is equivalent to 401 o 1,437.3 1,437.3 ^ '; therefore dividing the two quotients, 30.1-7-25.0, gives the same result, 120 per cent. This operation, strictly speaking, should not be considered as a comparison of ratios, but merely as a substitu- RATIOS 233 tion of the percentages for the original items in order to derive a simple ratio between them. Index numbers of time series: The emphasis in this kind of series also is on the relation of each item to the base rather than on direct relations between any two of the several items. Since indexes are apt to be more readily available than the original data and comparisons may be wanted between two specific periods instead of comparing either one with the base period, there is frequent occasion for express- ing comparisons between two individual index numbers. The pro- cedure is exactly parallel to that already described for percentage distributions: the two values may be subtracted provided the dif- ference is stated as a per cent of the base period; or, as has already been mentioned in the discussion of percentage increase or decrease, the second may be divided by the first, the same as would be done with the original items. Ratios on Different Bases. In the preceding section dealing with ratios on a common base the comparisons of such ratios were made in the same direction as the computation of the original ratios. That is, the numerator items differed from their base with respect to a single characteristic and the comparisons between the ratios were concerned with these same differences. When, for example, the char- acteristic was time in years, the various ratios computed on a base year were compared only with respect to their differences from this base year. No additional differences of any kind were introduced in the comparisons. Comparisons between ratios that are on different bases involve more complex relationships since they are always cross-comparisons of the original ratios. The ratios compared may be made up of like items or of unlike items and in either case the comparisons will be concerned with differences in a characteristic that was not involved in the separate ratios. Ratios of like items: Classification according to Characteristic: Since single ratios between like items are classified as time, space, or attribute, the comparisons between such ratios according to a second character- istic become a cross-classification of these three kinds of characteristics. Figure 33 presents this cross-classification with an example of each kind of comparison. Each of the three main groups of comparisons between ratios includes the three kinds of single ratios according to characteristic, with the exception of space comparisons of space ratios. 256 BUSINESS STATISTICS If the data for any of the examples in Figure 33 were set up in tabular form, it would be clear that since each ratio has a different base, there are no constant terms in the ratios that are being com- FIGURE 33 CLASSIFICATION OF COMPARISONS BETWEEN RATIOS OF LIKE ITEMS WITH EXAMPLES OF EACH KIND OF COMPARISON BETWEEN RATIOS KINO OF SIMPLE RATIO EXAMPLE Time 1. Time 2. Space 3. Attribute Ratios of the amount of magazine advertising in December to the amount in November, compared for a period of years. Ratios of the production of steel in Buffalo to the produc- tion in Cleveland, compared for each of 12 months of a given year. Percentage distributions by economic classes of the total value of United States exports, compared annually for several years. Space 4. Time 5. Space 6. Attribute Ratios of the indexes of department-store sales for Decem- ber of a given year to December of the preceding year, compared by Federal Reserve Districts. (This combination is impossible because there cannot be cross-classification of spatial characteristics.) Ratios of low-priced car sales to total passenger-car sales in a given year, compared by main geographic divisions of the United States. Attribute 7. Time 8. Space 9. Attribute The percentages of increase or decrease in value of United States exports over a period of years, the changes being compared by economic classes. The percentage of sales in a given sales district to total sales by a wholesale hardware company during a given year, compared by types of product distributed by the firm. Ratios of cash to installment sales in a given month, com- pared by departments, in a large department store. pared. In each case, however, there are certain points in common between the ratios which allow valid comparisons to be made be- tween them. Tests of Comparability: Most important of these is the fact that the ratios are "like" in the same sense that original items are "like." That is, they are identically defined except for one characteristic. The numerators of the ratios being compared are like and their denotni- RATIOS 237 nators are like, each set differing according to an identical classification which becomes the classification of the comparisons between the several ratios. A table showing the ratios used in Example 1, Figure 33, and in- cluding the original data, demonstrates the method by which the "likeness" of numerators and of denominators in any case may be determined and the consequent possibility of drawing comparisons between the respective ratios. In Table 37, the original ratios are each between two months in a given year, December to November, and these ratios are compared for several years. The headings and the title, as explained by notes in the original source, indicate that in each month and from year to year throughout the period the magazine lineage is measured by an identical method. For each month the data represent from 80 to 85 per cent of all magazines in the United States, the reports being compiled regularly by Printers 9 Ink. The numera- tors in column 2 are therefore identically defined, and likewise the denominators in column 1. Since the several numerators differ from one another only according to the stub classification, time in years, and the corresponding denominators follow the same classification, the resulting ratios in column 3 differ only according to this same char- acteristic. Hence the ratios are "like" and it is justifiable to draw comparisons between them: for example, to observe that for every year during this period, except 1935, the lineage has been smaller in December than in November, the declines ranging from .2 per cent to 26.3 per cent. If there has been any change in definition the data cannot be given under a single column heading without an explanation of the change, TABLE 37 CHANGES IN MAGAZINE ADVERTISING LINEAGE NOVEMBER TO DECEMBER, 1933-38 * (thousands of lines) YEAR (1) NOVEMBER (2) DECEMBER (3) PER CENT DIFFERENCE <2)-Kl)-100% 1933 1 899 1 791 5.7 1934 2,317 2,136 7.8 1935 2 201 2,334 + 6.0 1936 2,736 2,731 .2 1937 2,989 2,893 3.2 1938 2,251 1,658 26.3 * United States Department of Commerce, Survey of Current Business, 1936 Supplement, p, 24; 1938 Supplement, p. 25; September. 1939, p. 23. 258 BUSINESS STATISTICS in a footnote, which will indicate that the data are not comparable. In Example 2, Figure 33, if the figures for steel production included only the city limits for the first few months and the metropolitan dis- trict for the remaining months, this fact would be noted and the resulting ratios could not be compared throughout the year. The following example illustrates how an invalid comparison may be made due to a concealed change in definition which could have been discovered if the data had been tested by a tabular analysis similar to Table 37. This statement appeared in the editorial columns of a city newspaper: CAMPUS-SHY CUPID X.Y.Z. University is no matrimonial bureau, say alumni officials of that institution. A recent survey [1937] of the classes from 1928 to 1935, in- clusive, shows that fewer than half of the coeds graduated from X.Y.Z. in those years have married. Of the alumnae who answered the questionnaire, the following percentages reported they had married: 1928, 54.3 per cent; 1929, 46.8; 1930, 42.5; 1931, 41.4; 1932, 34.7; 1933, 30.2; 1934, 20.3; 1935, 12.7. In tabular form, the data might have been somewhat as follows: YEAR OF GRADUATION (1) NUMBER GRADUATING (2) 0) GRADUATES MARRIED BY 1937 Number Percentage of Total 1928 300 350 400 300 163 164 170 207 54.3 46.9 42.3 41.4 1929 1930 1931 etc. A study of the stub and column headings shows that since all of the reports were made in the same year, 1937, there is no time comparison between the ratios. There was a time difference in the year of grad- uation, and this classification in the stub gives the appearance of a time comparison between the several rows. However, the heading of columns 2 and 3, "Married by 1937," indicates that in order to get the true definition of the terms of the ratios, the date of each class must be subtracted from this fixed date. The result becomes a classi- fication by attribute the number of years since graduation, or the successively shorter periods during which each class has been exposed to the "hazard" of marriage. The ratios do show that for each addi- RATIOS 259 tional year since graduation a larger percentage of the members of any class will have married, but this fact scarcely requires proof. If a time comparison between ratios is desired, all the numerator items must be like in attribute, but the time of each ratio must be different. That is, each class must have been out of college for an equal length of time, the comparisons being made in successive years. If the ques- tion had been "Married within five years after graduation," then the resulting ratios would have been made in 1933, 1934, etc. Such data would probably give no clear evidence of either increase or decrease in the percentage of college women married. Another example will show that any violation of the previously discussed principles governing the construction of individual ratios will destroy the significance of comparisons between them. One of these principles stated that the possibility of further subdivision of data must be kept in mind before combining in ratio form two general totals that appear to be "like" in definition. The invalidity of such a ratio may not be apparent until it is compared with one or more similarly constructed ratios, with results that are contrary to known facts. A research bureau collected from 30 manufacturing and whole- sale concerns monthly data on the value of outstanding accounts and overdue accounts. From the totals of these 30 reports the ratio of overdue to outstanding accounts was computed as of the first of each month. However, when the July, 1937, ratio showed a noticeable increase over June, several of the concerns complained that the true situation was being misrepresented. Table 38, giving the collected data and accompanying ratios, shows what happened as a result of combining diverse elements in a single total. The ratio of overdue to outstanding accounts for all 30 firms increased from 20.4 per cent to 22.7 per cent between June and July, 1937, due to a decrease in the denominator. This decrease in the total outstanding accounts can be charged entirely to the 6 food concerns. Their outstanding accounts showed a drop so great that it more than counteracted the slight increase shown by the other 24 concerns. The ratios of the food concerns were quite different from the other 24 in both months. The 6 food concerns were subsequently reported sepa- rately, thus eliminating dissatisfaction. In this case, the numerator of each ratio consisted of 30 parts, each of which had a very definite relation to a corresponding part of the denominator. However, the 6 food concerns were so different from 260 BUSINESS STATISTICS TABLE 38 OVERDUE AND OUTSTANDING ACCOUNTS OF THIRTY CONCERNS FOR JUNE AND JULY, 1937, AND THE RATIO OF OVERDUE TO OUTSTANDING ACCOUNTS TYPE or CONCERN MONTH OUTSTANDING ACCOUNTS OVXRDUX ACCOUNTS OVERDUE + OUTSTANDING (%) 6 food Tune $173,901 $ 61,780 35.5 24 manufacturing July Tune 133,712 822,516 70,904 141,307 33.0 17.2 30 combined July June 836,410 996,417 149,212 203,087 17.8 20.4 July 970,122 220,116 22.7 the other 24 that when all were combined the result was a hetero- geneous total that did not correctly represent either of the component groups. Consequently, neither the individual ratios between these totals, nor the comparisons between the ratios, could be given any definite interpretation. The only way to analyze such a situation is to assemble the data, part by part, and to study each individual relation- ship in order to discover which totals or subtotals should be used in deriving ratio comparisons. Interpretation According to Kind of Relationship: The eight exam- ples in Figure 33 afford illustrations of all kinds of ratios according to relationship: part to part, total to total, and part to total. Numbers 1 and 4 are comparisons between part-to-part time ratios, the first being ratios between two corresponding parts of the same year com- pared for several different years, and the fourth being ratios between corresponding parts of two different years compared in different areas. In Number 9 the ratios are between two component parts of total sales, compared according to a second attribute, the different depart- ments of the store. The ratios in Number 2, compared for several successive periods, are between two cities in the United States and may therefore be regarded either as part-to-part or total-to-total rela- tionships. Numbers 6 and 8 are examples of comparisons between a set of part- to- total ratios and Table 38 illustrates the same kind of com- parison. In this type of comparison there are as many separate de- nominators as there are ratios being compared, all identically defined but not identical in amount. Reduction to ratio form has in one sense placed all the numerators on a common base of 100 per cent, but it must be remembered that the numerators are now relative instead of RATIOS 261 absolute amounts and hence are not subject to further computation with the same freedom as were the original data. Such cases need to be carefully distinguished from comparisons between the several parts within a single percentage distribution. In the discussion of ratios on the same base it was demonstrated that such ratios could be sub- stituted for the original data in making computations. This is no longer true when the bases are different, since they do not cancel out when one ratio is divided by the other. Comparisons between several series of ratios, each series having a single base, present the same situation as sets of single part-to-total ratios except that they are more complex and afford greater oppor- tunities for misinterpretation. Numbers 3 and 7 from Figure 33 are examples respectively of comparisons between several percentage dis- tributions and between several time series, both examples having been derived from the same set of original data. Part A of Table 39 shows original data on United States exports cross-classified according to four years in time, and according to five subdivisions of the attribute, economic class. In Part B, several per- centage distributions according to economic class are given, one dis- tribution for each of the four years. These may be interpreted together, somewhat as follows: Finished manufactures comprise the largest share of United States exports and have been increasing in relative importance throughout the period, from 41.8 per cent in 1934 to 49.0 per cent in 1937. Crude materials, the group second in importance, and manufactured foodstuffs have each meanwhile formed a continu- ally diminishing share of the total exports. Semi-manufactures and crude foodstuffs have maintained a fairly constant relative importance, except for the increase in semi-manufactures in 1937. If a table such as this one were read too hastily, the horizontal rows of per cents might easily be mistaken for index numbers on some earlier base. With this misinterpretation the first row would indicate that exports of crude materials had decreased in value each succeeding year, which is of course not the case. The actual indexes, which appear in Part C, and the per cents of increase or decrease in Part D, present an entirely different situation. The total value of United States exports has increased every year since 1934, the total in 1937 being 57 per cent greater than in 1934. A net increase for the four-year period has been reflected in every class of exports. The greatest and most consistent increases in proportion to 262 BUSINESS STATISTICS y a I 3U w : g 8 & o\ <s. ^ >o q ^ en v> O Ox CM CS^f i-i oo O *>o ^ c>i 06 >o n *\oo oo r- M o r- r^ '- r^ rH t-\o vo 00 "<f A^ trv^T ON " 8 ON ,-1 OO \O 00 -5f r- r- ON 00 VN B & 3 w o VO M*"* 2 L) ERCENTAGI fO 1 . tv. fO ON SiSSSS M Tf ON fO <N 00 VO ^N *-< O ON 00 -< *0 2 INDEX ON ON O O ONO ^ ** rH i-l i-H S fO ON 8 S ECONOMIC CLASSIC M . . 1 RATIOS 263 1934 appeared in manufactures, semi-manufactures showing slightly greater proportionate gain than finished manufactures. Manufactured foodstuffs declined in 1935 and 1936, but in 1937 exceeded the base year value by 6 per cent. Crude foodstuffs showed no change in 1935, a slight slump in 1936, but in 1937 jumped to 178 per cent of the 1934 value. Crude materials rose to 105 in 1935, dropped to 102 the following year, and in 1937 amounted to 111 per cent of the base year value. The interpretation of any set of parallel time series is full of hazards. Chief among these is a tendency to make cross-comparisons between the corresponding per cents in the various series as if they were abso- lute values instead of indicating that every change is in proportion to the base of its own series. Failure to read each change in relation to the base year may give the mistaken impression that the percentage of change refers to the year immediately preceding. These two examples of comparisons between series of ratios have brought out two entirely different sets of relationships neither of which was very obvious from the original data. In neither case was it necessary to make any further computations in order to express the comparisons. When they are thus stated in general terms, or are im- plicit, these relationships are usually clear and can be readily under- stood because the corresponding ratios in the several series are all expressed as relatives of 100 per cent. Ratios of unlike items: There is no difference from the foregoing method in dealing with comparisons between ratios made up of unlike items. They correspond exactly to total-to-total attribute ratios between like items and may be compared according to the same characteristics, time, space, or attribute. Figure 34 contains an example of each of the three classifications. Since any single ratio between unlike items must have a numerator and denominator that are identically defined in time and space, the most common comparisons between several such ratios will be with respects to differences in either one of these two characteristics. In such cases the numerators of all the ratios will be classified according to either time or space, their respective denominators will have the same classification, and this will become the classification of the ratio comparisons, as in Examples 1 and 2 of Figure 34. Ratios between unlike items are also sometimes compared with respect to an attribute, but since unlike items do not ordinarily possess 264 BUSINESS STATISTICS FIGURE 34 CLASSIFICATION OF COMPARISONS BETWEEN RATIOS OF UNLIKE ITEMS, WITH EXAMPLES OF EACH KIND or COMPARISON! BETWEEN RATIOS EXAMPLE Time Space Attribute Amount of income tax paid per capita in the United States, com- pared annually over a period of years. Ratios of turnover per store for a retail clothing chain during a given year, compared for 10 different cities. Ratios of new orders received by a manufacturing concern to its shipments of finished goods during a given month, compared by type of goods. a common set of attributes this is a situation less frequently found. In accounting practice, however, the unlike items may often be over- lapping parts of two different attribute classifications of the same thing. These attributes have a relation to one another although they are not mutually exclusive. However, since both terms of such ratios are expressed in the same unit they are subject to a cross-classification according to some other attribute of the common unit. Number 3 is such an example: new orders and shipments are items that overlap, but both may be classified according to type of goods, and the ratios between them may be compared according to that attribute. In other instances series of ratios between unlike items may occur in which all have the same denominator, the numerators being differ- entiated according to an attribute. Examples are: death rates in the United States for a given year according to specific causes of death; production of wheat per acre under identical conditions according to grade of seed sown; and the per capita consumption of beef, mutton, and pork in the United States. A number of such series, each on a single base, may be compared with respect to differences in any other characteristic common to both terms of the ratios. The deaths from various causes might be separated according to sex and the two sets of rates compared; the production of wheat from the different grades sown might be tested in several states to compare the effects of varying climatic conditions; and changing habits in per capita consumption of the three kinds of meat might be compared over a long period of years. The interpretation of comparisons between such series of ratios would be similar to the interpretation of series of ratios between like Items that was illustrated in Table 39- RATIOS 265 It will be recalled that ratios between unlike items are not expressed as per cents except in certain instances when both items have the same unit. The column heading of a set of ratios names the values below it as a certain number of the numerator unit per 1, 1,000, 100,000, etc., of a given denominator, as "per capita consumption, in pounds," "average individual income tax, in dollars/' or "number of automobile deaths per 10,000,000 gallons of gasoline consumed/' Consequently, the ratios in the table appear in the same unit as some of the original data, and it is more difficult to remember that they are relatives than when they appear in the form of per cents. If too many different com- binations of these ratio relationships between unlike items are given in a single table, the resulting confusion may be as great as when too many kinds of per cents are used together. There is also the same necessity for guarding against changes in definition of numerators or denominators during the course of the comparisons. The danger of comparing heterogeneous totals must likewise be avoided, since unlike items that are selected as having a relation to each other may not have been given limitations sufficiently specific. Some of these more complex problems involved in comparisons between ratios will be dealt with in the next chapter. Averaging Ratios Averaging ratios is one method of comparing them. However, the principles involved are sufficiently different from those discussed in preceding pages to warrant a separate presentation. There are two rules that must be observed: (l) ratios cannot be averaged unless they are comparable in every respect; and (2) whenever ratios are averaged they must be weighted according to their relative importance. 4 Comparability. This principle carries over directly from the earlier discussion of comparability of the terms of individual ratios and comparability between ratios. The data must be homogeneous for the purpose at hand, and the numerators and denominators must retain the same definitions throughout. In Table 40, column 3, the yield per acre of rye in the United States is given for three successive years, with the average yield for the three years combined. These data might well be considered as too general for some purposes, but for the purpose of comparing the 4 This is intended to apply primarily to the use of the arithmetic average. A geometric average is often computed without weights. 266 BUSINESS STATISTICS United States with other countries they are specific enough. The ar- rangement of the original data in columns 1 and 2 indicates that the definitions have remained constant: no change has occurred in the units of measure, acres, and bushels; the year in each case is the calendar year and not the crop year; and the total acreage and the total produc- tion have been secured from all the rye-producing states by methods of reporting that are essentially the same from year to year. Therefore the average yield for the three years, 11.4 bushels per acre, is valid from the point of view of comparability of the ratios from which it was computed. Weighting. The relative importance of ratios is determined by the values of their respective denominators. If ratios are weighted by their denominator values and then averaged, the result is the same as TABLE 40 PRODUCTION, ACREAGE, AND YIELD PER ACRE OF RYE IN THE UNITED STATES, 1933-35* YEA* (1) PRODUCTION (1,000 bu.) (2) ACRES HARVESTED (1,000 acres) (3) YIELD PEK ACRE (bu.) 1933 21 150 2 349 9.0 1934 16045 1 942 8.3 1935 57,936 4,063 14.3 Total 95,131 8,354 11.4 * Agricultural Statistics, 1936, p. 25, United States Department of Agriculture. a ratio between the sums of the original data of numerators and de nominators. The latter method was used in Table 40: * 11.4. 8,354 Whenever all the original data are available this is a much simpler procedure than weighting and averaging the individual ratios, as follows: 9.0 X 2,349 = 21,150 8.3 X 1,942 = 16,045 14.3 X 4,063 = 57,936 95,131 8,354 8,354 95,131 = 11.4 = weighted average Approximate relatives in round numbers may be used as weights instead of the actual denominator values, causing practically no dif- ference in the average. The acreages for the three years are in the RATIOS 267 approximate proportion of 7, 6, and 12. Using these weights and dividing by the sum of the weights gives the following result: 9.0 X 7= 63.0 8.3 X 6 = 49.8 14.3 X 12 = 171.6 25 284.4 284.4 25 zn 11.4 = weighted average It is therefore possible to average ratios for which no accompanying data are available provided that there is available instead a set of weights proportionate to the values of the actual denominators. If no information is at hand concerning the relative importance of the de- nominators, the ratios cannot be averaged. The only case in which a simple average of ratios will give a correct result is when all the denominators are of equal importance, that is, when they are identical in value. However, this constitutes no excep- tion to the rule that weighting is always necessary. Such an average is not unweighted, but all the ratios have been given equal weights. Using some other set of weights in place of the denominators of the ratios will not give a valid average. 5 Some other factors may appear to be of importance but the average will be distorted unless such factors are combined with the separate denominators before the original ratios are computed. Table 41 shows in Part A an incorrect weighted average, and in Part B the same data averaged by the correct use of weights. The weights used in Part A are the numbers of workers at each wage rate. Obviously these are of importance in determining average wage increases, but if so they must be incorporated in the denominators of the original ratios. Instead of being merely wage rate, the numerator and denominator of each ratio should be payroll, that is, rate X n um- ber of employees, as in Part B. Since the number of workers at each rate is assumed to be the same in 1936 as in 1926, the per cents of change in the payroll (Part B, column 6) are the same as the per cents of change in wage rates (Part A, column 3). The weights, however, are in different proportion since in Part B the numbers of workers have been multiplied by varying wage rates as of 1926. Reduced to a common base of 100 the weights in Part A would be 40, 32, and 28, but in Part B they would be 13, 5 An exception to this rule is explained in connection with Table 74, page 394. 268 BUSINESS STATISTICS 00 OT OS o ^^ sg X II H O O O O cO O O 00 00 SO O N^ ITS (N *H 00 w 1 s (4) WEIGHTS Number of Employees m O 00 cO CN 04 r-t \O w S O M b^ *2 O > t H P^ W -I H W O t i I* > o f W *S 2 ^ TABLE 4 j OF PER C NCORRECI _g Percentage Increase* ) ^ (1) O 00 ; CORRECT o * CM $4 Z 3 o P M << H 5 VO CO o. 3 ^"X - 9> 8 ^ 8 u bo > NATOB ^0 *% O VO VO CO 2 4- H X 00 o o O O O I s * C^O^fN^ C^ OS g so r^- H * o in o% VO P |i ? O fN VO t-t vd co fS /-s II w O M 1 | CO SO CN <N O OS OS vq V) Jg CO y co vq^r** vq^ vd ^^ 3 _i r-( fN| rH P fc JS CM * -^ II "s CN O os r^ tS VO OS M "o ss Sa 'P ^o v H CO CM NX o o o o ITv O fN ^ v (N O !*. Cs d -S S^ N -' ' 3 i2 O o> s ii g 0, in O 00 CN <N TH VO (N Q^f s S CO 0\ S 00 ^ S bo & ^ CM O O O V *"* I: 81 18 a TD S S V * :& 111 " :% IS " I P.-I i RATIOS 269 51, and 36. Consequently the average of the weighted ratios in Part B differs from that in Part A. The average in Part B is the correct one because it coincides with the ratio of the total original data, (total column 5 -f- total column 4, 100%) . Similar proof cannot be applied to the method used in Part A. The total of column 2 divided by the total of column 1 minus 100 per cent equals + 36 per cent whereas the weighted average is -f-29.8 per cent. It might be argued that in an actual case the number of workers at each wage rate would not remain the same over a period of 10 years. In such cases the payrolls can be determined for each year and, if the ratios between them are weighted by the denominators as in Part B, the average will again agree with the ratio between the total original data. The usefulness of this average representing total payroll increase may be open to question. There might be no change at all in wage rates, or even in total number of employees, but a shift in the distribu- tion of personnel at the various rates would cause an increase or de- crease in the total payroll. 6 An analysis of each rate separately would probably be of greater value. However, the average of changes in total payrolls as shown in Table 41 is of practical value in calculating the effects of planned payroll changes for a given number of employees. PROBLEMS 1. For each of the following pairs of items, compute the ratio and (1) state the relation in words; (2) give reasons for selecting the item you used as the base; (3) justify in terms of the text the number of units used in the base. a) Total tonnage of steel produced in 1938 18,692,937 gross tons Total tonnage of steel consumed by the automotive in- dustry in 1938 3,155,906 gross tons b) Number of commercial banks in the United States re- porting retail installment paper in their portfolios as of Dec. 30, 1939 10,382 Amount of retail installment paper held $541,367,000 c) Bales of cotton produced in the United States in 1938. . . 12,008,000 Bales of cotton produced in Brazil in 1938 1,877,000 d) The average weekly wages of steel workers in the United States was $35.90 in 1929 and $29.40 in 1939- e) Population in United States registration area, 1930 118,560,800 No. of deaths from diphtheria, 1930 5,822 2. From any issue of the World Almanac select ratios illustrating each of the three sizes of base explained in the text. What is the relation between the numerator and the denominator of each ratio? 6 This question is developed in greater detail under standardized ratios in the next chapter, and weight bias is discussed in connection with index numbers in chapter XIX. 270 BUSINESS STATISTICS 3. a) In 1938 the sales per salesperson in 8 department stores located in cities with population less than 20,000 was $8,500; for 30 stores located in cities with population over 1,000,000 the corresponding figure was $18,000. b) In 1935 the per capita sales of drugstores in Cleveland were 110 per cent of the sales in Detroit. c) In 1938 the average amount of each dollar of revenue set aside by Class I railroads to pay taxes was 9 cents. To what extent do these examples conform to the rules stated in the text for ratios between unlike items? 4. Given the following ratios: a) Dollars paid by industrial consumers of electricity to the number of kilowatt hours consumed during a single year in a given industrial area. b) Bank debits in New York City to bank debits in the rest of the United States during a given month. c) Average number of active spindles in cotton textile manufacturing in the New England area this year compared with the corresponding figure for last year. d) Time deposits of commercial banks to demand deposits of those banks as of a certain call date. e) Sales of Chevrolet passenger automobiles to sales of Ford passenger automobiles last year in the state of California. /) Imports of wheat into the United States from Canada to production of wheat in the United States for last year. g) Number of deaths caused by industrial accidents in New York State in January, 1940, to the number in January, 1941. Classify these ratios in three ways: (1) like items or unlike items, (2) time, space, or attribute, (3) part-to-total, part-to part, or total-to-total. 5. From all the families with incomes of $1,000 and over in the following table, what per cent have no automobiles? Show method of computation. SELECTED FAMILIES IN PORTLAND, OREGON, WITH INCOMES OF $1,000 AND OVER, 1933 INCOME GROUT NUMBER OF FAMILIES PERCENTAGE NOT HAVING AN AUTOMOBILE $1 000-1 499 1 426 34.2 1 500-1 999 1 068 25.0 2,000-2,999 701 16.7 3,000-4,999 300 13.0 5,000-6,999 45 2.2 7,000 aikl over 27 11.1 6. Given the following information concerning deposits and depositors in mutual savings banks and postal savings in 1932 (000 omitted for all figures) : RATIOS 271 STATES MUTUAL SAVINGS BANKS POSTAL SAVINGS Deposits Depositors Deposits Depositors United States $10,040,000 5,287,000 63,000 12,700 5,900 100 $783,000 82,000 20,000 1,540 200 40 New York Minnesota a) Compute whatever ratios you consider necessary to compare methods of saving from the data given. b) Write a statement of your findings. 7. Each student will be given one assignment for each part of this problem. Answer either (a), (b), (c), or (d), throughout. CIGARETTE CONSUMPTION IN THE UNITED STATES (Millions) July, 1933 9,526 July, 1936 14,801 July, 1934 11,355 July, 1937 15,290 July, 1935 13,138 Compared with the same month of the previous year, compute the per- centage relation and percentage change. (a) 1934 with 1933 (b) 1935 with 1934 (c) 1936 with 1935 (d) 1937 with 1936 B. The following data show the amount of retail trade (in millions of dollars) in Buffalo in 1935 for 8 lines of trade (column 9 being "all others") and the total: (D (2) 54.9 37.8 (3) 22.9 (4) 26.0 (5) (6) 86 6.8 (7) 17.3 (8) (9) 6.3 24.8 Total 205.4 Express the relation between the 2 items assigned to you, and the per cent which each is of the total retail sales. (a) Column 1 food, and column 7 eating and drinking places (b) Column 2 general merchandise, and column 8 drugstores (c) Column 3 apparel, and column 4 automotive (d) Column 5 furniture and household goods, and column 6 lumber, building, and hardware C. Number of automobile fatalities and millions of gallons of gasoline consumed by motor vehicles in four states, 1934: N. Y. TEXAS IOWA N. H. No of fatalities.. 2,903 1,579 531 104 Millions of gallons of gasoline. . . 1,501 875 404 71 272 BUSINESS STATISTICS Express the relation between gasoline consumption and deaths from auto- mobile accidents in: (a) New York, (b) Texas, (r) Iowa, (d) New Hampshire. 8. Are there any of the ratios in the following that you cannot interpret? If so, explain what additional information is needed in order to draw valid conclusions from the ratios. PRODUCTION OF MAPLE SUGAR AND SYRUP IN THREE LEADING STATES, 1937 AND 1938 AVERAGE TOTAL PRODUCT PER TREE PERCENTAGE OF UNITED STATES PRODUCTION As Sucrar (pounds) As Syrup (gallons) 1937 1938 1937 1938 Vermont 1.5 1.8 2.7 1.8 .290 101.0 2.3 1.7 1.9 2.0 .283 98.6 .19 .22 .34 .23 1.60 77.7 .29 .21 .24 .25 1.61 78.2 37.9 25.7 15.3 100.0 New York Ohio United States Average price Percentage of 1925-29 average 9. The following is quoted from the report of a tobacco manufacturer to the stockholders: "Government figures, with our own figures, prove that our Company obtained during the first ten months of 1940 .... 59.93% of all the cigarette increase of the entire industry, ff What additional data would be needed in order to determine the importance of this report? 10. Locate each of the three sets of ratios of problem 7 according to Figures 33 and 34, and state which of the simple ratios are part-to-part, part-to- total, or total-to-total relations. 11. Describe two separate methods of computing the average per cent living on farms in Table 32, page 241. State exactly what data would be needed for each computation. 12. a) Compute the percentage change in average value per contract of non- residential building contracts, from the following data: 1937 1938 NUMBER COST NUMBER COST 124,305 $564,961,000 116,993 $567,069,000 b) With the following additional information discuss how much sig nificance can b* attached to your original ratio: RATIOS 273 TYPE OF NON-RESIDENTIAL 1937 1938 CONSTRUCTION NUMBER COST NUMBER COST Private garages and sheds 96,514 $27,423,000 91,147 $23,798,000 Other private construction 26,711 413,072,000 24,497 402,762,000 Public works, public buildings and utilities 1,080 124,466,000 1,349 140,509,000 13. An industrial plant had the following number of employees at 2 different periods, with their respective total weekly payrolls. TYPE OF 15 25 IS >28 EMPLOYEES Aver. Number Employed Aver. Total Weekly Payroll Aver. Number Employed Aver. Total Weekly Payroll Administrative and clerical . Skilled 194 320 $ 9,409.00 14 630.40 156 235 $ 7,566.00 10,744.20 Unskilled .... 608 13,254.40 731 15,935.80 Total 1.122 37.293.80 1.122 34.246.00 A union of unskilled workers used the foregoing totals to prove that there had been a decrease in wages of 8.2 per cent. Discuss. REFERENCES BAILEY, WILLIAM B., and CUMMINGS, JOHN, Statistics. Chicago: A. C Mc- Clurg & Co., 1917. Chapter VI lays down the pattern from which all subsequent work on ratios has been built. BURGESS, ROBERT W., Introduction to the Mathematics of Statistics. Boston: Houghton Mifflin Co., 1927, chapter II. JEROME, HARRY, Statistical Method. New York: Harper & Bros., 1924 chapter VIII. CROXTON, FREDERICK E., and COWDEN, DUDLEY J., Applied General Statistics. New York: Prentice-Hall, Inc., 1939, chapter VII. CROXTON, FREDERICK E., and COWDEN, DUDLEY J., Practical Business Statistics. New York: Prentice-Hall, Inc., 1934, chapter VII. RIEGEL, ROBERT, Elements of Business Statistics. New York: D. Appleton- Century Co., 1927, chapter IX. CHAPTER XII APPLICATIONS OF RATIOS THE basic principles and uses of ratios were discussed in the preceding chapter. Some more complicated cases of ratio analysis omitted there are presented in this chapter. The sub- jects included fall into three groups: (l) refined ratios and their ap- plication in standardized form; (2) compound ratios and the conditions for their interpretation; (3) some types of ratios used in particular fields of business. REFINED RATIOS General In a refined ratio special care is exercised to define both numerator and denominator so as to exclude whatever extraneous factors tend to obscure the direct relationship between the two items. The ad- vantage of the refined ratio lies in the opportunity of selecting in the denominator the item or items that are directly related to the numera- tor. Thus in vital statistics the ratio of measles cases to population under 16 years of age conveys much more information than the ratio of measles cases to total population. In the latter case a decline in the ratio over a period of years might be the result of an increase in the number over 65 years of age in the population, whereas the in- cidence of measles among those likely to contract the disease may have been unchanged. The ratio of labor cost in a factory to total cost of manufacture is a useful figure, but the denominator contains two kinds of cost, fixed cost and variable cost. The ratio of labor cost (a variable cost) to total variable cost gives a figure which is more valuable to management in analyzing operations. In the same way safety departments of manufac- turing plants get an over-all accident rate for the entire plant by taking the ratio of employees injured to number employed. This figure is of value only as a summary. The danger of accidents varies from one department to another both in frequency and severity. In a steel plant the tipping of a ladle in the furnace room is of infrequent occurrence but usually results fatally to workmen in the path of the hot metal. On 274 APPLICATIONS OF RATIOS 275 the other hand men engaged in piling steel rods for storage may be involved in injuries rather frequently but the injuries are seldom fatal. Similar contrasts can be made between departments in any type of manufacturing operation. To take account of these variations safety men compute the ratio, accidents to employees, for each department separately. Both the numerator and denominator of each departmental ratio are refined further in order to facilitate the study of accidents. The result- ing ratio, known as the accident severity rate, is the number of days' work lost through accidents 1 divided by the number of equivalent full-time days worked. 2 The rate may be expressed in any unit of time, per week, per month, or per year. The study of deaths in automobile accidents furnishes another ex- ample of the use of refined ratios. Columns 1 and 2 of Table 42 con- tain the number of persons killed in automobile accidents and the population of the United States yearly from 1930 to 1938. The number of persons killed in automobile accidents per million population is given in column 5. The steady rise of this ratio from 1930 to 1937, interrupted only in 1932 and 1933, is the basic fact of the so-called "automobile menace." The marked decline of the ratio in 1938 appears to be the first evidence of abatement of the "menace." That this conclusion is premature can be seen from further study of available information. The ratio of fatalities to population takes no account of the changed hazard resulting from increase in the number of automobiles on the highways. This factor is included in the ratio of fatalities to automobiles registered, shown in column 6. This ratio fluctuates irregularly from 1930 through 1933, rises to a high point in 1934, and except in 1937 has declined since that time. The 1938 ratio of 109 fatalities per hundred thousand automobiles registered is the lowest recorded during the nine-year period. This ratio indicates that the increased death rate after 1934 was not propor- tionate to the increased hazard represented by the number of automo- 1 The number of days' work lost can be counted for temporary accidents but not for death, permanent disability, or permanent impairment. Even temporary disabilities such as the loss of a finger will lead to different numbers of days' work lost. Consequently stand- ards have been established for each type of accident. Thus, according to one standard 6,000 days are allowed for death, 4,000 days for loss of an arm, 1,200 days for loss of a thumb and one finger, etc. Through the use of these standards, accident seventy can be measured as between departments of a plant, between plants or between industries, as well as for different time periods. United States Bureau of Labor Statistics Bulletin No. 234, The Safety Movement m the Iron and Steel Industry, p. 278. 2 The number of equivalent full-time days worked is obtained by dividing the total number of man-hours worked during a given period by the standard working hours per day 276 BUSINESS STATISTICS biles in use. In fact, the decline in fatalities per car since 1934 may be evidence of increased caution on the part of drivers. But there is another factor to be considered. Cars are being driven more miles in recent years; hence the exposure to accident is increased. The ratio, number of deaths per 100,000,000 gallons of gasoline con- sumed each year, as shown in column 7, allows for the increased mileage of cars. If the average number of miles obtained per gallon of gasoline had remained fixed, then the number of gallons of gasoline consumed would have a constant relation to the number of miles cars were driven. The average number of miles per gallon has probably changed slightly during this nine-year period, but accurate information on this point is not available. Taking column 4 as a legitimate substi- tute for miles driven, we find that the ratio of deaths per 100,000,000 gallons of gasoline consumed has declined sharply from the high point reached in 1934. The ratios in columns 6 and 7 show that the increase in the automobile death rate through 1937 is due to the increased ex- posure of the population to the hazard of automobile accidents in spite of a decline since 1934 in deaths per motor vehicle registered and per gallon of gasoline consumed. This conclusion does not in any way affect the desirability of a reduction in automobile fatalities. It does seem to indicate that, on the average, drivers maintain better control of cars in recent years and that TABLE 42 FATALITIES IN AUTOMOBILE ACCIDENTS RELATED TO POPULATION, MOTOR VEHICLES REGISTERED, AND GASOLINE CONSUMED, 1930-38 CD* (2)t (3)t (4)t (5) (6) T, (7> T^r\ f\i> FATALITIES FATALITIES YEAR PERSONS KILLED IN AUTO- MOBILE ACCIDENTS POPULA- TION (000 omitted) No. OF MOTOR VEHICLES LICENSED (000 omitted) INO. OF GALLONS OF GASOLINE CONSUMED (000,000 omitted) FATALITIES PER MILLION POPULA- TION (1) -^ (2) PER HUNDRED THOUSAND MOTOR VEHICLES REGISTERED (1) *- (3) PER 100 MILLION GALLONS OF GASOLINE CONSUMED (0^(4) 1930. 32,540 123,091 26,545 14,751 264 123 221 1931. 33,346 124,113 25,814 15,408 269 129 216 1932. 29,196 124,974 24,115 14,250 234 121 205 1933. 31,078 125,770 23,874 14,224 247 130 218 1934. 35,769 126,626 24,952 15,292 282 143 234 1935. 36,023 127,521 26,231 16,264 282 137 221 1936. 37,500 128,429 28,166 17,855 292 133 210 1937. 40,300 129,257 29,705 19,218 312 136 210 1938. 32,000 130,215 29,486 19,610 246 109 163 Figures compiled and published by the Travelers Insurance Company, Hartford, Connecticut. t Statistical Abstract, 1938, p. 10. t Mimeographed releases of the United States Department of Agriculture, Bureau of Public Roads. Gasoline consumption is by motor vehicles only. APPLICATIONS OF RATIOS 277 further attempts to reduce the automobile accident toll will depend upon the compilation and study of more detailed information concern- ing the accidents that occur. This most likely means ratios subdivided for day and night accidents, accidents occurring on city streets and on open highways, accidents related to age of driver and age of motor vehicle, etc. Standardized Ratios Standardization of ratios consists in separating an over-all ratio into several mutually exclusive parts and computing a new com- bined ratio in the form of a weighted average of the several part ratios. The weights selected are a distribution of the denominators of the several part ratios according to some accepted standard, and these weights are held constant throughout any series of ratios that are being compared. The use of standardized ratios originates in the field of vital statis- tics where standardized and corrected death rates, birth rates, etc., are employed in comparisons between different cities or sections of the country. The crude death rates of two cities may differ because of a difference in the age composition of the two populations although the death rate for each age group is identical in the two. The effect of variation in the age composition of different populations can be ad- justed by the use of either standardized or corrected death rates. 8 These methods can be transferred advantageously to the field of business ratios. In most cases the concept of the corrected rate rather than the standardized rate is applicable to business situations. What are known as "standardized ratios" in business applications are the equivalent of "corrected rates" in vital statistics. The business usage is defensible because the expression "corrected rates" is likely to con- vey the impression that the unstandardized rates contain errors. Such is, of course, not the case, but the standardization leads to more precision in the interpretation of results. Department Store Example. The experience of a department store will serve to illustrate the standardization of ratios. The officials were using the amount of the average sales check in four selected depart- ments combined as a quick evidence of changes in business conditions. 8 Standardized and corrected rates are computed differently and lead to ^ different results. Particular circumstances will determine which should be used in a given case. The details of both computation and interpretation are well presented in Raymond Pearl, Mtdical Biometry and Statistics (Philadelphia: W. B. Saunders Co., 1923), pp. 19S-207. 278 BUSINESS STATISTICS The total dollar value of sales and the number of sales were subject to wide seasonal swings but the ratio of the two the amount of the average sales check exhibited little seasonal influence. In the past this ratio had proved to be a very sensitive indicator of approaching busi- ness depression or recovery. Table 43, column 3, shows the ratio for August and September, 1936. TABLE 43 SALES, NUMBER OF SALES CHECKS, AND AMOUNT OF AVERAGE SALES CHECK FOR A DEPARTMENT STORE IN AUGUST AND SEPTEMBER, 1936, FOUR SELECTED DEPARTMENTS COMBINED MONTH (1) SALES NUMBER OF SALES CHECKS A (3) AMOUNT OF AVERAGE SALE August $48,102 8 013 $600 September 45 530 7660 594 The declining tendency of sales, number of sales, and the amount of the average sales check caused considerable consternation because everyone expected September results to be above the August level. When the figures for the entire store became available a sizable expan- sion of sales was shown as well as an increase in the amount of the average sales check. The question then arose as to what had happened to the hitherto reliable preliminary indicator. A study of the results by departments is given in Table 44. In De- partments I and IV the percentage of decline in the number of sales checks exceeded the percentage of decline in sales, so that the amount of the average sales check increased. In Departments II and III the percentage of increase in number of sales checks was less than the percentage of increase in sales, which again resulted in an increase in the amount of the average check. It seemed strange then that the average check should have decreased for the four departments com- bined. The explanation lies in the fact that Departments II and III with small average sales checks had increased sales while Depart- ments I and IV with somewhat larger sales checks showed great de- clines in sales. This combination of changes shifted the weights of the four departments so much that the combined ratio declined. The change in the amount of the average sales check in Table 44 is dependent upon shifts in sales among the departments and changes in the average amount purchased per customer. Since the intention was to measure only the latter change, a means of eliminating the effect of APPLICATIONS OF RATIOS 279 the former had to be devised. This was done by setting up a standard distribution of sales checks among the four departments and computing the amount of the average sales check each month for this standard distribution. TABLE 44 SALES, NUMBER OF SALES CHECKS, AND AMOUNT OF THE AVERAGE SALES CHECK FOR A DEPARTMENT STORE IN AUGUST AND SEPTEMBER, 1936, FOUR SELECTED DEPARTMENTS COMPUTED SEPARATELY AUGUST SEPTEMBFI i Department (1) Monthly Sales (2) No. of Sales Checks . (3) r Amount of Average Sales Check (l)-7-(2) (4) Monthly Sales (5) No. of Sales Checks (6) Amount of Average Sales Check (4) + (5) I $10416 1 010 $1031 $ 4293 380 $11.30 II 9622 1,595 603 11 889 1,862 6.39 Ill 21 840 4621 473 26400 5,112 3.16 IV 6224 787 7.91 2,948 306 9.63 Total or average 448.102 8.013 S 6.00 545.530 7,660 $ 5.94 TABLE 45 COMPUTATION OF THE AMOUNT OF THE AVERAGE SALES CHECK (STANDARDIZED) FOR A DEPARTMENT STORE IN AUGUST AND SEPTEMBER, 1936 Department STANDARD FIGURES AUGUST SEPTKMBEB (1) Standard Distribution of Sales Checks (2) Standard Distribu- tion of Sales Checks With Total of One (1) -f- 8,182 (3) Amount of Average Sales Check (4) August Standard- ized Sales (3) X (2) (5) Amount of Average Sales Check (6) September Standard- ized Sales (5) X (2) I 845 1,729 4,615 993 .10328 .21132 .56404 .12136 $10.31 6.03 4.73 7.91 1.065 1.274 2.668 .960 $11.30 6.39 5.16 9.63 1.167 1.350 2.910 1.169 II HI IV Total or average 8,182 1.00000 5.97 6.60 The standard distribution of sales checks was obtained by taking the average monthly number of checks in each department for the year 1935. The selection of this standard was more or less arbitrary, but it approximated the actual distribution of sales checks among the four departments. The computation of the standardized average sales check is shown in Table 45. The standardized distribution of sales checks is given in column 1. The reduction of these figures so that their sum is unity is shown in column 2. The computation of the amount of 280 BUSINESS STATISTICS the standardized average sales check is shown in columns 3 and 4 for August and in columns 5 and 6 for September. The multiplications are indicated at the head of the columns. The average sales check increased from $5.97 in August to $6.60 in September after the effect of shifts in sales between departments had been adjusted. There is some likelihood that the standard distribution of sales checks will lose its representativeness as time goes on. To avoid this contingency the distribution of sales checks for the most recent calendar year could be used as the standard. The results for long periods of time will not be comparable if a changing standard is used, but the purpose is merely to get a preliminary judgment from month to month; hence the long run situation is unimportant. Labor Turnover Example. Another point in our economic system at which standardized rates should be employed is in the measure of labor turnover. A measure commonly employed is known as the separa- tion rate. It is the ratio of the number leaving employment during a period to the number on the payroll during the period. For example, the separation rate of a manufacturing plant would be measured as shown in Table 46 TABLE 46 COMPUTATION OF MONTHLY LABOR TURNOVER (CRUDE SEPARATION RATE) OF A MANUFACTURING PLANT (l) (2) (3) NUMBER OF NUMBER OF CRUDE MONTH EMPLOYEES EMPLOYEES SEPARATION ON PAYROLL LEAVING RATE AT BEGINNING EMPLOYMENT (per cent) OF MONTH DURING THE MONTH (2) -i- (1) March 4800 536 11.2 November 4660 453 9.7 These figures show an appreciable decline in the separation rate from March to November. This is a crude rate which for some pur- poses may be satisfactory, but it would not be safe to conclude that there was greater labor stability in this plant in November. It is well known that out of a group of men hired at any time a certain number will dislike the work and leave within a few days, others will be found to be unsatisfactory and will be discharged after a brief trial. The re- mainder will continue working for longer periods and those who stay as long as one or two months are likely to remain with the company APPLICATIONS OF RATIOS 281 for years. That is, the "employment mortality" decreases with increase in length of service. It would therefore be desirable to study the separation rate with the length of employment in the plant standard- ized. The method of doing this is shown in Table 47. The stub shows the length of time employees had worked for the company as of March 1. Column 3 shows the number leaving employment during the month and column 4 shows the separation rate by length of time employed. These specific rates show a gradual decline as the length of time em- ployed increases. The higher rate for those employed ten years or over reflects the separations due to retirement, disability, and death. The specific rates follow a course quite similar to that of the specific death rates of a population: that is, a high infant mortality rate, a gradual decline to past middle life, and then an increase at the upper ages. Exactly what should be used as a standard would have to be deter- mined for each plant separately. 4 The use of the average distribution of employees for the last calendar year seemed best suited for the plant in question because it had a highly fluctuating labor force, varying between 2,500 and 6,000 within three years. The use of a five-year or even a three-year standard would have placed too much emphasis on a past situation and there was too much variation to attempt the se- lection of an arbitrary standard. The standard distribution of employees in column 1 was obtained by taking the average of figures similar to those in column 2 for the twelve months of the preceding calendar year. These one-year averages are expressed in the form of decimals totaling unity following the method shown in Table 45, column 2. This form saves one step in the computation. If the actual average distribution of employees were used in column 1, the entries in column 5 would be the number of separations that would have occurred in the standard distri- bution of employees at the specific rates for the actual distribution of employees in March. The total of the standardized separations divided by the total of the standard distribution would give the same standard- ized separation rate that was obtained by the method in the table. How- ever, using column 1 gives the standardized separation rate directly as the total (.1021) of column 5. The same computations give the standard- ized separation rate for November (.1026) , as shown in columns 6 to 9. The summary at the bottom of the table shows that the decline in 4 If, however, a standardized rate were to be determined for several plants or an entire industry, the same standard would necessarily have to be used throughout. 282 BUSINESS STATISTICS a 3 a PU T> A " T3 U 2 i- ^ ^ c S * rt 5 nj ^> CV.O 11-11 VIOE ES ooooooooo <N<NrHr-<OO fNrOX}- OOO O-iOOO<NOO-ivOOO ooooooooo '13 ** - 8P APPLICATIONS OF RATIOS 283 the crude separation rate from March to November was entirely due to change in length of employment. The standardized rate indicates that there was essentially no change in the forces of labor unrest lead- ing to separations. 5 This example and the one dealing with sales checks of a department store show the advantage of standardizing ratios in order to separate the significant causes of an observed change in the crude ratios. The method has not been used much in the past by business statisticians, but a wider application can be expected in the future to meet the demand for more exact interpretation of the ratios used to measure changes in business operations. COMPOUND RATIOS When a ratio between two ratios is computed the result is known as a compound ratio. Such ratios require careful interpretation, because the computation is of the form, numerator! numerator 2 , ^. i _ _ = compound ratio denominator! * denominator The result shown by the compound ratio may have arisen from changes in either or both numerators, either or both denominators or from changes in the numerator and denominator of either or both ratios. With so many possible combinations of causes, it is evident that mis- interpretation may easily occur. The conservative conclusion, therefore, would be to eliminate compound ratios entirely. However, they have become an integral part of the business man's use of statistics; hence it is preferable to explain their valid use rather than to eliminate a technique which has considerable practical value. Compound ratios can be divided into three groups: (1) those in which the denominators of the two simple ratios are stable; (2) those in which the simple ratio used as denominator is stable; and (3) those in which all of the constituents fluctuate. The latter type of ratio can be used only as a general indicator of changes while the first two can be interpreted specifically. The distinction between the three in both form and interpretation can be seen in the examples that follow. 5 Current information concerning labor turnover in manufacturing plants is released in mimeographed form monthly by the United States Bureau of Labor Statistics. The reports include data for "quits," "discharges," "lay-offs," "accessions," separation rates, and turnover rates. This release does not make use of the refined rates presented in the text. 284 BUSINESS STATISTICS Stabilized Denominators In this type, one ratio is divided by another whose denominator is not identical with that of the first ratio but the difference is so slight that for practical purposes the compound ratio is valid. It is subject to the same interpretations and comparisons with other compound ratios in the same series that could be made if all the denominators were identical. For example, a study of changes in the loans and investments of member banks of the Federal Reserve System is presented in Table 48. The decrease in the proportion of assets invested in government securities can be seen in column 3 but column 4 provides additional information concerning the months in which the decline was most pronounced. It might appear that the same facts could be shown equally well by ratios between the successive numerators, in column 2. Comparison of these simple ratios, listed in column 5, with the compound ratios in column 4 shows, however, that this is not the case. Column 5 indicates that the greatest decline in investments in government securities oc- curred from March to April, but column 4 indicates that the greatest decline in the ratio, investments in government securities to total loans TABLE 48 TOTAL LOANS AND INVESTMENTS IN GOVERNMENT SECURITIES OF REPORTING MEMBER BANKS OF THE FEDERAL RESERVE SYSTEM IN 101 LEADING CITIES, JANUARY-JUNE, 1937 MONTH (l) TOTAL LOANS AND INVESTMENTS* (000,000 omitted) (2) INVESTMENTS IN GOVERNMENT SECURITIES* (000,000 omitted) ^ (3) GOVERNMENT SECURITIES AS A PER- CENTAGEOF TOTAL LOANS AND INVESTMENTS (2) -Ml) /4) COMPOUND RATIO: PER CENT THAT EACH MONTH'S RATIO IN COLUMN 3 is OF THE PRE- CEDING MONTH (5) SIMPLE RATIO: PER CENT THAT EACH ITEM IN COLUMN 2 is OF THE PRECEDING MONTH January. . . . $22 734 $10,493 46.2 Jebruary. . . . March 22,600 22 610 10,330 10 008 45.7 443 45.7 ^ = 98 - 9 ^ - 969 98.4 96.9 April . ... 22,280 9628 43.2 45.7 ~ 9 9 43.2 O7 ^ 96.2 May 22,201 9,483 42.7 44.3 ~ 97 '' 42.7 oa Q 98.5 June 22,330 9,515 42.6 43.2 ~ 98 ' 8 426 . __=. no Q 100.3 42.7 ~ "* 8 * Federal Reserve Bulletin. September, 1937, p. 924. APPLICATIONS OF RATIOS 285 and investments, occurred from February to March. Further, column 5 shows that investments in government securities increased from May to June, whereas column 4 shows that the ratio of investments in gov- ernment securities to total loans and investments declined slightly from May to June, although the rate of decline was gradually tapering off. It is evident, therefore, that while both columns 4 and 5 provide valid comparisons the information contained in column 5 is in no sense a substitute for that contained in column 4. Denominator Ratio Stable This case arises where one ratio or set of ratios is used as a standard to which another ratio or set of ratios is to be compared. Trade and mercantile association secretaries frequently employ this form of ratio comparison in studying the distribution of costs of doing business of individual association members. A wholesale grocers' association had 28 members. The members did not agree to divulge their actual sales or costs of operation, but each month they reported the per cent which each of the items listed in Table 49 was of the sales for the month. With the aid of an agreed- upon set of weights, proportionate to the sales of the individual mem- bers, the secretary of the association computed the average percentage distribution of the several items making up total sales. This distribu- tion served as a standard to which each of the members could compare his own operations. Columns 1 and 2 of the table give the average TABLE 49 PERCENTAGE DISTRIBUTION OF SALES OF WHOLESALE GROCERS INTO COSTS OF DOING BUSINESS AND PROFIT; CONCERN NUMBER 10 COMPARED WITH AVERAGE FOR 28 CONCERNS* ITEMS (l) ^ (2) PERCENTAGE DISTRIBUTION OF CoiTl (3) PERCENTAGE VARIATION OF CONCERN No. 10 FROM THE AVERAGE (2) -*- (1) 100% 28 Member Concerns Member Concern No. 10 Administrative expense 6.1 1.7 8.1 2.0 4.5 76.3 1.3 6.4 4.2 10.7 2.4 6.1 69.3 .9 + 4.9 +147.1 + 32.1 -f 20.0 H- 35.6 9.2 30.8 Rent, interest, and insurance .... Selling expense Handling expense Delivery expense Cost of goods sold Profit Total 100.0 100.0 * Data taken from the files of the secretary of a wholesale grocers' association. 286 BUSINESS STATISTICS distribution and the distribution for one of the members, respectively. Comparison of the two columns shows the items for which the par- ticular concern had either better or poorer than average results. How- ever, the variation with regard to specific items is brought out more clearly by the compound ratios in column 3. These ratios must be interpreted as showing not the percentage difference in absolute dollars by which each item of Concern Number 10 differed from the average but as dollar for dollar in proportion to total sales. That is, the actual dollar profit of this concern was not 31 per cent less than the average profit of the 28 together but for every dollar of profit that should have been made by Concern Number 10 according to the standard only 69 cents was realized, despite the fact that the cost of goods sold was 9 cents on the dollar less than the average. Similarly, this concern's selling expense was 32 cents per dollar greater than the standard, the handling expense was 20 cents greater, and the delivery expense was 36 cents greater. The fact that selling, handling, and delivery expense exceeded the average was no cause for alarm. Concern Number 10 did a large proportion of its total business in fresh fruits and vegetables. It used a unique system of selling. Trucks were sent daily through all of the territory served by the concern. On the theory that truck drivers are not good salesmen, a salesman who had nothing to do with handling goods rode on each truck. His duty was to sell goods which the truck driver immediately delivered. This system added to the concern's sell- ing costs and its delivery costs. The high handling costs arose from the nature of the goods. That is, car-load shipments of oranges, grapes, tomatoes, and similar things required extra handling on account of spoilage, the need for protection from cold weather, and the necessity of moving the goods quickly, so that the higher operating costs were to be expected. The lower cost of goods was likewise clear enough because perishable goods must be marked up a greater percentage on cost to provide for waste, spoilage, and the additional cost of quick sales. The small profit margin, however, was unsatisfactory and further investigation was undertaken. The administrative expense was not much above the average but the investigation indicated that some economies might be made at that point. However, the great variation in the percentage of sales going for rent, interest, and taxes was a revelation to the owner. Further study showed that rent made up a great part of the total of this item. APPLICATIONS OF RATIOS 287 This raised the question whether the concern could reduce the floor space used. The rent paid for the combined office and warehouse building in which most of the business was conducted seemed to be quite low, but two auxiliary warehouses in which bulky commodities such as potatoes were stored had been leased at high rentals. By re- arranging the use of space in the main warehouse and reducing the in- ventories of these bulky commodities, the concern was able to abandon the use of both auxiliary buildings when the leases expired. The charge for rent was thereby greatly reduced without any increase in other costs. The saving was therefore reflected directly in increased profits. This example shows what can be done through the interpretation of compound ratios when one of the two sets of ratios is used as a standard. The members of the association were able, through the use of the association secretary's report, to compare their own operations with a standard established under similar conditions. Yet no member of the association had divulged any information which he desired to be confidential. Fluctuating Numerators and Denominators Table 50 shows the computation of the change in the ratio of ac- counts receivable to sales from August to September. The 25 per cent increase in the ratio of receivables to sales in September computed as follows, (50 -T- 40) 100% = + 25%, cannot be given any specific interpretation. Any information concerning the reasons for the increase must be obtained from a study of the original data from which the simple ratios were computed. Both the receivables and the sales increased but the receivables increased 50 per cent while the sales increased only 20 per cent, thus accounting for the increase in the ratio. Further study would be needed to discover why the receivables increased so much. The indications from the figures are that all of the increase in sales was credit business. If such proved to be the case management would have to consider the implications of a continuation of the same tendency in the future. All of the interpretation of the table arose from study of the original data rather than the ratios. The ratio of receivables to sales would have indicated the change that took place just as well as the compound ratio. There is, consequently, no advantage in computing the com- pound ratio beyond that of further summarizing the information con tained in the simple ratios. 288 BUSINESS STATISTICS TABLE 50 COMPUTATION OF PERCENTAGE CHANGE IN RATIO OF RECEIVABLES TO SALES FROM AUGUST TO SEPTEMBER RECEIVABLES CHANGE FROM PRECEDING MONTH ACCOUNTS RECEIVABLE SALES SALES (per cent) RATIO OF RECEIVABLES TO SALES (per cent) August $20,000 $50,000 40 September 30,000 60,000 50 +25 There are many applications of compound ratios in the analysis of business conditions. Those uses in which the denominators of both the simple ratios or both terms of the denominator ratio are stable will lead to results which can be interpreted rather exactly. On the other hand those cases in which the numerators and denominators of both ratios vary freely are more difficult to interpret. Because of this diffi- culty it is better to avoid the chance of misinterpretation by using such compound ratios merely as a guide to information which may be obtained from study of the simple ratios involved. EXAMPLES OF THE USE OF RATIOS IN BUSINESS In this chapter and the preceding one considerable emphasis has been placed upon the widespread use of ratio analysis of business data, and numerous illustrations have been presented. Such usage ranges all the way from the simplest percentages to highly specialized ratios. Some of the latter are of particular interest because of the comprehen- sive analysis of business activity which results from their use. Four examples of specialized ratio analysis are presented here: (1) a rail- road analysis; (2) a retail credit department analysis; (3) a depart- ment store analysis; and (4) a financial statement analysis. Railroad Analysis These ratios originate partly in the Interstate Commerce Commis- sion and partly in the Association of American Railroads. They pertain to several phases of railroad operation as indicated in Table 51. A complete explanation of the ratios would require too much space in this book. Each ratio provides specific information concerning a certain phase of railroad operations. For example, row (f), the freight revenue per train-mile, and row (h) , the passenger revenue per train-mile, con- APPLICATIONS OF RATIOS 289 TABLE 51 RATIOS USED IN THE ANALYSIS OF RAILROAD OPERATIONS IN THE UNITED STATES DATA FOR SELECTED YEARS SHOW THE TREND OF OPERATIONS* RATIO UNIT 1900 1910 1920 1930 1935 Freight Operations (a) Revenue freight per train-mile (b) Revenue ton-miles per mile of road ton thousand ton-miles 271 735 380 1 071 639 1,597 699 1,481 646 1,185 (f) Revenue per train-mile dollar 2 00 2 86 6 81 7 56 6.51 (d) Revenue per ton-mile cent .729 .753 1.069 1.074 1.000 Passenger Operations (e) Average journey per passenger (/) Av. passengers per train-mile. (g) Revenue per passenger per mile (/>) Revenue per train-mile mile person cent dollar 27.8 41 2.00 1 01 33.5 56 1.94 1.30 37.3 80 2.76 2.78 38.0 49 2.72 1.85 41.3 47 1.94 1.35 Finance (/') Investment per mile of road. . ( j) Taxes per mile of road dollar dollar 61,490 233 64,382 431 81,954 1,262 105,661 1,519 106,339 1,062 (k) Operating income per mile of road dollar 7,722 11,866 24,361 20,564 14,483 * Compiled from reports of the Interstate Commerce Commission and the Association of American Railroads. tain ratios showing the income from each type of service per unit of transportation employed. The compound units, freight-train mile and passenger-train mile, are designed especially to measure the basic operation of railroading, the movement of a train over a mile of track. The difference between the trends of freight and passenger traffic can be seen by studying the figures for these two ratios in the table. Freight revenue per train-mile increased each decade through 1930 but by 1935 had declined to $6.51, a drop of 14 per cent from $7.56, the high value of 1930. On the other hand passenger revenue per train- mile declined from $2.78 in 1920 to $1.85 in 1930, a drop of 34 per cent. A further decrease brought the 1935 revenue to about the 1910 level. The great decline in passenger revenue per train-mile can be understood by reference to "average passengers per train-mile/' row (/) . In order to render passenger service, the railroads were forced to run their trains even though the number of passengers per train-mile in 1935 was little greater than the number carried in 1900. In contrast the number of tons of freight per train-mile, row (a), rose steadily to a high value in 1930 and was only slightly smaller in 1935. The ability of the railroads to control the size and the frequency of operat- ing freight trains has resulted in a much smaller decline in revenue per freight-train-mile than that experienced per passenger-train-mile. The ratios of Table 51 are all of the same type and can be cata- logued as follows: 290 BUSINESS STATISTICS 1. They are ratios between unlike items expressed in different units, including a number of compound units. 2. All of the ratios are refined to some extent in both numerator and denominator, e.g., ''revenue ton-miles" excluded non-revenue freight and empty cars transported; "miles of road" is defined to ex- clude yard and terminal multiple trackage, all side tracks and auxiliary tracks and duplicate main tracks (a four-track line 100 miles in length is counted as 100 miles of road). 3. No compound or standardized ratios are included. Retail Credit Department Analysis A set of ratios designed to measure the activities of a retail credit department was published at the University of Michigan. This investigation was undertaken to learn and make generally available facts regarding the costs, problems, and performances of credit and accounts receivable departments in retail stores. From the outset, the study was planned so that as many as possible of the resulting facts would take the form of typical figures which could be used as standards for appraising performance in individual stores. 6 The ratios which were developed in the study have been in general use since then. They were the following: Per cent of returns to net [credit] sales 7 Per cent of credit office payroll to net [credit] sales Per cent of accounts receivable office payroll to net [credit] sales Per cent of total payroll [credit department] to net [credit] sales Per cent of losses from bad debts to net [credit] sales Payroll cost per transaction in the accounts receivable office Number of transactions handled yearly per accounts receivable office employee Average yearly salary in the credit office, in the accounts receivable office, and in both offices combined Per cent of collections to first-of-the-month outstandings Per cent of credit sales to total store sales These ratios are quite specialized but that is exactly what is needed in dealing with the peculiar type of work performed by the credit department. Their use enables credit managers to follow very closely 6 Carl N. Schmalz, "Operating Statistics for the Credit and Accounts Receivable De- partments of Retail Stores 1927," Michigan Business Studies, Vol. I, No. 6 (June, 1928). Bureau of Business Research, School of Business Administration, University of Michigan, Ann Arbor, Michigan. 7 Net credit sales include both charge accounts and installment accounts. In the study the information was obtained separately for the two types of accounts, so that the ratio analysis could be made for the two separately or combined. APPLICATIONS OF RATIOS 291 the efficiency of collections. Where credit information including volume and character of operations of the stores in an area is pooled in the hands of an association, ratio analysis of credit conditions for the entire area becomes possible. There is tremendous advantage to the individual stores in doing this, as it provides a standard with which each can compare its own results. Analysis of Department Store Operations The Bureau of Business Research of the Harvard University Gradu- ate School of Business Administration publishes an annual report analyzing the operations of department stores. The report is based on information supplied by some 600 department and specialty stores located in cities in all parts of the country. A list of the ratios used in analyzing the information is shown in Table 52. TABLE 52 RATIOS PERTAINING TO OPERATIONS OF 55 DEPARTMENT STORES IN THE UNITED STATES HAVING SALES BETWEEN $2,000,000 AND $4,000,000 IN 1934* RESULT RATIO OBTAINED Net gain Percentage of net sales 2.4% Percentage of net worth 4.6% Rate of stock turn (times a year) Based on beginning and ending inventories 4.4 Based on monthly inventories 3.8 Returns and allowances Percentage of gross sales 8.5% Percentage of net sales 9.3% Distribution of net sales Cash 42.0% C.O.D 5.3% Charge 45.8% Installment 6.9% Sales per square foot of total space $13.70 Real estate cost per square foot of total space $0.69 Sales per employee $5,500 Losses from bad debts (percentage of charge sales) 95% * Selected from Tables 19 and 20 of Carl N. Schmalz, "Operating Results of Department and Specialty Stores in 1934," Bureau of Business Research Bulletin No. 96 (June, 1935), Boston: Graduate School of Business Administration, Harvard University, pp. 27 and 28. The ratios are self-explanatory. They provide a variety of informa- tion concerning operations, all of which is essential to management. The changes in these ratios from year to year indicate trends in de- partment-store operation. Likewise the figures for any year serve as a standard to which an individual store can compare its results. For ex- ample, a store of comparable size finding that its stock turnover was less than 1.0 per year would want to take steps to find what types of 292 BUSINESS STATISTICS merchandise were moving slowly and whether it would be feasible to reduce inventory or expand sales, or both. A store with 3 to 4 per cent bad debt losses would want to investigate conditions in the credit and accounts receivable departments. Similar variations in any of the ratios would lead management to investigate. Weak places in the organization can frequently be discovered and corrected through this type of analysis and comparison. Analysis of Financial Statements A method of rating a borrower as a credit risk has been developed by Alexander Wall. 8 By the use of seven ratios a numerical basis is provided whereby bank executives are materially aided in determining the lines of credit that can be granted to customers. For a detailed explanation of the use of these ratios it would be best to read the reference given. We are interested mainly in the statistical application of ratios involved and that can be explained most readily with the aid of an example. Table 53 shows the balance sheet and annual sales of a concern for a five-year period. From these annual reports the seven ratios of Table 54 can be computed for each year and for the five years combined. The exact source of the numerator and denominator of each ratio is indicated by the letters in parentheses. Thus the ratio of net worth to debt for the first year (n -- m} is obtained as follows: ($492,105 -f- $156,094) =3.15 or 315 per cent, and all of the other ratios are obtained by similar computations. Some of these ratios have rather high values and in general very careful study is required to judge the concern's standing as set forth in Table 54. The next step, therefore, is to refer the individual ratios to a standard. The standard chosen is the set of average ratios for the five-year period shown in column 6 of Table 54. The compound ratios in columns A to E of Table 55 are obtained by dividing each of the ratios in columns 1 to 5 of Table 54 by the standard ratios in column 6 of Table 54. These standardized ratios could then be averaged to obtain a credit index. But they are not all of equal importance. The next step, there- fore, is to establish a set of weights that can be used to give the several ratios their proper importance in determining the credit index. 8 Alexander Wall and Raymond W. Duning, Ratio Analysis of Financial Statements. New York: Harper & Bros., 1928. APPLICATIONS OF RATIOS 293 TABLE 53 ASSETS, LIABILITIES AND SALES OF A HYPOTHETICAL CONCERN ANNUALLY FOR A FIVE-YEAR PERIOD* DATE FIRST YEAK SECOND YEAR THIRD YEAR FOURTH YEAH FIFTH YEAH TOTAL FOE FIVE YEARS (a) Cash 36 285 34 479 27 776 37 206 51,157 186,903 ( b) Receivables 229,437 208,363 220 666 231,171 233,540 1,123,177 (f) Inventory 236,586 208,376 245,367 265,304 255,004 1,210,637 (d) Total current . . . . 502,308 451,218 493,809 533,681 539,701 2,520,717 (e) Plant and equipment (/) Miscellaneous 115,389 30,502 146,884 28,940 169,045 39,708 170,195 75,219 132,037 65,184 733,550 239,553 (#) Total fixed 145,891 175,824 208,75^ 245,414 197,221 973,103 (h) Total 648,199 627,042 702,562 779,095 736,922 3,493,820 LIABILITIES (/) Payables 139,894 12,030 4,170 152,455 719 4,183 220,539 4,083 3,195 308,880 5,779 10,261 255,459 7,218 29,843 1,077,227 29,829 51,652 (/) Taxes (k) Miscellaneous (/) Total current . . . (m) Total debt . 156,094 157,357 227,817 324,920 292,520 1,158,708 156,094 157,357 227,817 324,920 292,520 1,158,708 () Net worth 492,105 469,685 474,745 454,175 444,402 2,335,112 (o) Total 648,199 627,042 702,562 779,095 736,922 3,493,820 (/>) Sales 2,590.000 2,590,000 2,660,910 3,068,720 3,262,808 14,172,438 * Wall and Duning, op. at., p. 297. These weights are shown in column F of Table 55. They depend mainly on the accumulated judgment of Mr. Wall and his associates. 9 The final step in the computation of the index is shown in columns G to K of Table 55. The weighted combined result appears in index form at the foot of columns G to K, Table 55. The concern shows a strong credit position in the first year, is somewhat weaker the second year, but still has a high index. Decided weakness appears in the last three years, but the last year demonstrates marked recuperative powers in the concern as compared with the fourth year. 10 Summary Many additional examples of the application of ratios in the analysis of business operations might be cited. Those presented here have been 9 The complete argument for the use of weights and the reasons for the particular set selected are given in the source, pp. 157-62. 10 Those interested in the interpretation of the credit index should read from the source, pp. 299-306. 294 BUSINESS STATISTICS u 00 O CM CM -< SO r- rH"t o vo r^ v\ o w ll CM CM CM CM r ^ i ^}' VO ^ u >-/ VO^ W-\ V^ CM r- O Xf Xf oo CM r\ ON oo r\ rc% * t CM -< fO CN VO f^ c O M u ^ jq 53 xf vs o r-- r-- o vo VO OO *^" fM \r\ v\ f^ ~~~*S^^ H g o te Q tn O M D Ci" "^ ** r-- r- oo vo ^ r\ o *- CM o o oo r^ vo CM CM CM CM O <N T\ -i 4 """" CD Q O u P 3 g S*J f*~- t^ 00 f<^ *O rO r-i CM CM CM CM CM XJ" T\ r-T rH rH n -^ ^ fl S S 5 g S * 8 ss| CM h- T\ ON VS ITS VO CM rA r-* fsj O\ |< fsj r-l rH rH g ,2 i W CU 2 6 w C/5 H j. :::::: i M I & ^^"^ s il '''^xS^^; a X ^O "^*"s^"^ i > ^ K? 2 >^ ""^^ *' i ^"^ 1! 42 ''" "I'^J^'I' G S w '^-'^ x - x A.>-s 12 c 8 ^TT^ ^^ issll^f 2^ <S> (A <S< <f) ^ t^ ^ rt rt C4 c5 vJ Z x c/") c/) c/"i c/) 8^1 ^> O rs O O O r> CM rH I-* -" ON >T O rH Xf 00 -H O r-l VO 1R ON *W ON ^ ^, \f\ \f\ \r\ O O O fs r^ ^ CM nosv5 m CM r- 00 w ^ ^ X OO rH r^- O ON 00 /N CM 55 Q ? B rH rH rH rH 00 U c b O r\ ys O O O O o CM r- vo rn oo vo ^\ ^ w^ os ON oo *^ o r- r<" ON -a CM rH CM ON E S wU g..;! O ir\ O O O O rs O vo O 00 VO r CM r- CM * -w fTk rH T\ rH rH CM ^, o o o o o o *N O O O ON **}* CM ro I-** rH OS 00 CS CM >f ITS CM 00 rT> r-i rH ~" fTk CM CT> rH S " ir\ ir\ r^ o O O >r\ CM rH CM r- rH O . o . * ^u-g /> ^f (TN r- OS ^T *~ S-si 00 OS t- r- O r- CM * '. <A ^^ S^ ^^ J C ^ i^ ^ ^ ^ u, 6 f r^ r vo o ON oo rn -* rH : : < "*" g*| C ON O ON v7\ 00 OS ^ I O Q * H P P ^ u ^ CM r-i 00 00 VO f-H rH fO. r- XJ ON O O ON : : 3- ^ Sc7 ^^ oo o vo ON XT CM r- < ^ ! "^ ^ r\ oo CN CM oo '"'C [j illifl! ii c. j O i- 3IUIII H APPLICATIONS OF RATIOS 295 chosen mainly to demonstrate the variety of practical uses of ratios. The reader will also note the extent to which the ratios are specialized to meet the needs of the particular type of business to which they are applied. These ratios have been developed as the result of long experience and careful study. They are powerful tools of analysis in the hands of skilled investigators. Their use by less well-trained per- sons, who are unable to adhere to the basic principles of ratios which underlie all of the more complicated applications, may easily lead to gross error in the interpretation of results. All of which points to a final observation. Ratios are perhaps the most widely used statistical technique, but it is also true that no other technique produces an equal amount of misuse in practice. PROBLEMS 1. What refinement would you recommend in the denominator of each of these ratios? a) Employees killed in train accidents to total number of employees of railroads. b) The number unemployed in a community to the number of persons in the community. c) The ratio of investments of banks to loans and investments to show the tendency over a period of years toward increased importance of investments in bank portfolios. 2. The following data were compiled from reports of the United States Bureau of Census and the American Medical Association: STATK PHYSICIANS PER 1,000,000 POPULATION IN THE U. S. DEATH RATE PER POPULATION IN THE REGISTRATION AREA 1,000 b s (1927) California 1,781 13.9 New York 1,669 12.3 Massachusetts 1 552 116 Maryland 1,501 13.2 Illinois 1,492 11.4 Pennsylvania 1,251 11.4 New Jersey 1,078 11.2 VC^isconsin 1,056 10.1 a) Why is the ratio of physicians to population highest in those states in which the death rate is highest? b) From the figures given compute the deaths per physician in each state. Is it true that the deaths per physician are highest where the ratio of physicians to population is lowest? Explain the result. 296 BUSINESS STATISTICS 3. The following data* appear to indicate that the rate at which workers were re-employed in private industry from Relief and W.P.A. rolls was lower at the peak of industrial expansion in 1937 than in 1935. Duration of Unemployment (number of months) MARCH, 1935 MARCH, 1937 Number of Unemployed Workers on Relief and W.P.A. Number of Workers Leaving Relief and W.P.A. Because of Piivate Employment Number of Unemployed Workers on Relief and W.P.A. Number of Workers Leavine Relief and W.P.A. Because of Private Employment Less than 2 .... 2 to 4 110 152 204 219 255 392 477 515 404 651 15 23 20 9 7 7 7 5 4 5 104 131 147 222 260 415 573 521 397 583 12 16 14 10 9 8 10 6 4 5 4 to 6 6 to 9 9 to 12 12 to 18 18 to 24 24 to 36. .. . 36 to 60 60 and over. , . . Total 3,379 102 3,353 94 Over-all rate: March. 1935 102 = 3.02% March, 1937 94 = 2.80% * The figures are the result of making necessary adjustments to insure comparability in the records available in a citv which had about 28.000 gainfully employed according to the 1930 census. a) Compute the re-employment rate according to duration of unemploy- ment for each period. b) Do the results confirm the decline shown by the over-all rate? If not explain the discrepancy and devise a plan for the construction of comparable over-all rates. 4. State the three types of relation between the four elements of a compound ratio. 5. Given the following data concerning population and number of persons paying income tax in the United States: 1929 1936 1. Population (estimated) 121,945,000 128,883,000 2. Number filing income tax returns 4,044,327 5,413,499 3. Percentage of population filing income tax returns. . 4. Percentage increase in number filing returns in 1936 5. Number of income tax returns by persons with net income over $5 000 .... 3.32 1 032 071 4.20 +26.51 677011 6. Percentage of those filing returns who had income over $5,000 25.52 12.51 7. Percentage decrease in 1936 in number filing re- turns who had income over $5 000 50.98 APPLICATIONS OF RATIOS 297 a) ALC the per cents in rows 4 and 7 equally valid? Why or why not? b) Do these two per cents show that income has increased in the middle brackets at the expense of larger incomes, i.e.,* that wealth is gradually becoming more evenly distributed? Discuss. 6. What basic error would be involved in averaging the ratios in column 3 of Table 49 of the text? 7. The following data are computed from annual reports of the United States Steel Corporation: 1929 1939 Output per man-hour $1 80 $2.48 Average wage per man-hour .674 .897 Labor leaders argue from these ratios that labor has not received a fair share of its increased productivity. What further evidence should be intro- duced before reaching a conclusion on this point? 8. Which of the credit department ratios on page 290 are favorable when they increase? Which are favorable when they decrease? Explain in each case. 9. Compute from Table 52 the change in net gain as a percentage of net sales, if all losses from bad debts were eliminated. 10. a) Explain the two methods of measuring inventory used in obtaining the stock turnover in Table 52. b) Which of the two is more exact? Why? 11. The heading of the stub of Table 55 of the text is "Compound Ratios." Which of the three types of compound ratios are these? REFERENCES BURGESS, ROBERT W., Mathematics of Statistics. Boston: Houghton Mifflin Co., 1927. Chapter III contains an explanation of the effect on the resulting ratios when weights are shifted. RHODES, E. C, Elementary Statistical Methods. London: George Routledge and Sons, Ltd., 1933. Chapter 5 gives an excellent exposition of the use of refined ratios. WALL, ALEXANDER, and DUNING, RAYMOND W., Ratio Analysts of Financial Statements. New York: Harper & Bros., 1928. Chapters V-XII deal specifically with ratio analysis, although the entire book should be read to understand the system of analyzing financial state- ments. CHAPTER XIII GRAPHS INTRODUCTION AJY representation of statistical values and relationships in pictorial or diagrammatic form is called a statistical graph. There are many forms in which such graphs are commonly used, and new and ingenious devices are constantly being worked out for the graphic presentation of statistical data. Reasons for Using Statistical Graphs Graphic methods have been developed to meet the needs of two major classes of users, the lay reader and the statistician. For the Lay Reader. It is a well-recognized psychological principle that a direct visual concept such as color or size can be more quickly grasped and more readily understood than a set of numbers and table headings. A statistical table must first be read and than translated mentally into the actual concept of dollars of wages, bushels of wheat, etc. Although this is a process that is no more difficult than the reading of any other kind of printed material, the average reader is so frightened at the mere sight of a set of figures that he is likely to shy away from any table without even trying to see what it is about. In order to lure the attention of such readers, attractive graphs are an almost indispensable accompaniment to any popular exposition of statistical material. For the Statistician. Statisticians have also discovered that appro- priate graphic methods have sometimes clarified relations that remained obscure even after careful study of the numerical data. Graphs of analysis have therefore become a tool of the statistician for his own benefit as well as a medium for explaining his final results to others. Purposes Served by Graphic Methods Standard methods can best be analyzed by reference to some exam pies of a few well-known kinds of statistical graphs. Four of these basic types are illustrated in Figure 35. The first, A, is a simple bar 298 GRAPHS 299 FIGURE 35 TYPES OF GRAPHS POPULATION OF THREE LEADING CITIES ON THE WEST COAST, 1930 LOS ANGELES SAN FRANCISCO SEATTLE 300 600 THOUSANDS 900 1200 BJ SOURCES OF CASH FARM INCOME UNITED STATES, 1938 (MILLIONS OF DOLLARS) PRODUCTION OF STEEL INGOTS AND CASTINGS BY UNITED STATES AND REST OF THE WORLD, 1939. U S. REST OF WORLD (32 MILLION TONS) (68 MILLION TONS) AIR FORCES OF LEADING COUNTRIES SEPTEMBER 1940, (INCLUDING TRAINING SHIPS) GERMANY GR. BRITAIN RUSSIA ITALY JAPAN U. S. A. ' EACH SYMBOL EQUALS 2000 PLANES graph in which the length of a bar represents the population in each of the three cities. In B, the total cash farm income in the United States is represented by the complete area of a circle, and the pro- portion derived from each source of income is indicated by a sector of the circle, each sector shaded to distinguish it from the others. 300 BUSINESS STATISTICS Two squares are shown in C, the areas of which indicate the number of tons of steel produced by the United States and by the rest of the world in a given year. A number of equal-sized symbols in D indicates the approximate strength of the air forces of several countries in 1940. From these examples the first statistical purpose of graphic methods can be observed. To Show Numerical Values and Relationships. In place of the actual figures that appear in the cells of a table, numerical values are represented by diagrams. These may consist of geometric forms such as the length of a line, the number of degrees in the sector of a circle, a square or other area, or they may contain a small number of symbols that scarcely need to be counted. The numerical relationships between these values also can be grasped instantly without the necessity of reducing them to the form of ratios or other statistical measures. Referring again to Figure 35: In A the eye unconsciously estimates the lengths of the three bars, so that it is not even necessary to read the scale to perceive the relation of the populations of the three cities. In B the order of importance of the various sources of cash farm income and the proportion of each to the total can be seen without measurement. Likewise in D, the relative strength of the air forces of the several countries can be estimated by comparing the lengths of the several rows of symbols, without either counting the symbols or knowing how many units each symbol stands for. It is not so easy to compare the size of the two squares in C, and for this reason comparison by means of areas 1 is considered less effective than linear comparisons. The simple types of graphic form in these illustrations show nothing except numerical values and relationships. No attempt has been made to introduce other characteristics of the data such as the spatial rela- tionships between the cities in Part A, between the United States and the rest of the world in C, or between the various countries in D. However, the pictorial representation of non-numerical relationships such as these is the second major purpose of some statistical graphs. To Show Non-numerical Relationships. By the use of more com- plete and specialized types of graphs, time, space, or qualitative attri- butes can be presented in addition to numerical values. This is true 1 Perspective drawings of three dimensional objects in which values are represented by vol\UM||girStfH more difficult to evaluate. GRAPHS 301 in spite of the fact that the number of possible methods for translating data into diagrammatic form is essentially limited to the few that have already been illustrated. There are not more than half a dozen methods altogether that can be used on a plane surface such as a page or a wall chart, but the number of ways in which dots, lines, areas, etc., may be combined in different arrangements has led to a seemingly endless variety of graphic types. Every statistical table, except those in which the grouping is purely quantitative, contains in the stub or column headings a classification according to one of the non-numerical characteristics. In order to emphasize relationships between classes more realistically than is pos- sible in tabular form, the data can be presented in certain specialized kinds of graphs. Space: An alphabetical list of states conveys no picture whatever of actual geographical relationships; for instance, that Missouri is just across the river from Illinois; that Idaho, Nevada, and Utah form a group of mountain states; and that all the largest cities on the eastern seaboard fall within a radius of a few hundred miles. A statistical map in which certain numerical values pertaining to each state are depicted in their actual geographical arrangement lends itself to much more penetrating analysis of spatial relations. - Time: A column of dates may indicate successive years, while the adjacent column shows the volume of production in each year. This tabular form of presentation gives no such vivid concept of actual time movement and of the accompanying growth or decline in production as is afforded by a line whose movement across a graph one can follow along with the passage of the years. Attributes: Qualitative attributes are difficult to represent graphi- cally except by symbols. If black bars and shaded bars are used to represent male and female respectively, one has to read the key in order to have the slightest inkling of what the bars stand for. Rows of different symbols may be used instead of bars, as in Figure 35-D. This is one-third of the original graph. The middle section contained rows of warships representing the sea power of the six countries and the lower section contained rows of soldiers representing the armed forces of the six countries. In the three parts of the graph, therefore, unlike units rather than different attributes were being distinguished, but the same idea can be carried over to symbols representing attributes such as urban and rural, white and Negro, etc. 302 BUSINESS STATISTICS Construction of Graphs Tests of a Good Graph. Regardless of the particular purpose, every graph can be tested by one general criterion: The method of graphic construction is good if it produces an effective picture of important relationships and gives a true representation of statistical facts. Con- versely, a method that results in an ineffective, uninteresting graph, or Chat distorts statistical facts, is bad. Steps. The steps in making a graph are the same as those followed unconsciously by a reader of the graph, but in reverse order. The reader's attention is first caught by the effectiveness, or attractiveness of a graph. If his interest is aroused, he then studies the graph and notes what information is presented through the various devices that have been used. Without obvious effort on his part, the facts and rela- tionships which the graph was intended to emphasize will become clearly impressed upon his mind. Interpret the data: For the statistician who plans and constructs the graph, however, this process of interpretation which is so easy for the reader becomes the first and most important problem. Before planning to illustrate any given set of data by means of a graph he must decide what relationship he intends to emphasize. He may wish to compare percentage changes in each of several commodities over a period of years, or their values in absolute amounts, or their percentage relation, either to each other or as parts of a total value in each year. The example of various ratio comparisons from a single set of primary data shown in Table 39, page 262, and the accompanying interpreta- tions illustrate one of the possible initial steps of analysis that the statistician takes in extracting significant information from the data. 7 Choose the best graphic method: His next step is to determine which type of graph is best suited for presenting the relation he has determined is most important. There is often more than one way of picturing a statistical relationship, whether numerical or non- numerical, but years of testing have put the stamp of approval on certain methods as particularly effective for each purpose. These will be the subject of the balance of this chapter and the major part of chapters XIV and XV. ~j Draw the graph effectively: The final step is to plan the actual arrangement and draw the graph. The factors that contribute to artistic effect and accuracy are more or less common to all kinds of graphs, and will be considered in chapter XIV after the detailed discussion of GRAPHS 303 the simple types of graphs and some of the more complex graphs. It will be assumed throughout that the graphs are being prepared either for the business man who is not a trained statistician or for the general public. Graphs drawn for the use of statisticians should follow prac- tically the same principles, although with greater emphasis on accuracy of detail and less on pictorial effect. SIMPLE TYPES OF GRAPHS METHODS AND PURPOSES OF EACH In the ensuing discussion there is no intention of describing or even naming every kind of graph that has ever been used or that could be used. Only the commonly accepted types that have proved most suc- cessful in the presentation of statistical material 2 will be described in detail, and the major emphasis will be placed on the special purposes for which each is particularly well adapted. Maps The most obvious method for picturing spatial relationships is the map; it is more truly an actual "picture" of the facts than any other type of graph. There are many types of maps, ranging from the ordi- nary outline map which shows only geographical boundaries of land and water to so-called "distortion" maps at the other extreme. It is not the purpose of this section to describe all the possible maps that may be useful for various purposes, but only to point out the necessary features of maps that are definitely statistical. A map will be considered statistical only when quantitative relationships are represented by some pictorial device instead of in numerical form. / Location Maps. Outline maps form the background of statistical maps, and are also frequently used in business for non-statistical pur- poses. For example, a sales manager interested in knowing where each of his salesmen is working from day to day may move colored pins about on an outline map of the territory. But this is not a statistical map because no numerical information is involved. Some location maps indicate non-numerical differences, such as states having certain types of liquor control or those in which a certain 2 Business men use many types of diagrammatic representation that are adaptations of standard graphic methods. A variety of these, such as organization charts and Gantt charts are described in books on applied graphics. Space limitations have prevented the inclusion of a comprehensive presentation of these usages in a textbook on statistical method, although a few specialized forms appear in chapter XXVI. 304 BUSINESS STATISTICS gasoline corporation has retail outlets. Different colors or shades of cross-hatching are used in such maps, but unless the differences are in some way quantitative the map is not statistical, and its construction need not be governed by the rules given later for cross-hatched ratio maps. Many other location maps are found in print that present numerical information but are nevertheless not statistical. This is true of any map in which the quantities, ratios, etc., are simply inserted in each state or other subdivision in figures instead of being represented by shading or some other pictorial device. For most purposes such a map is harder to read than a table of the data, and no total visual impression is given of the spatial distribution of the numerical values. A map in which evenly dotted areas signify the principal oil fields in the United States is also merely a location map. However, if each dot should be located at a point where there was production of 10,000 barrels of oil during a given period, the map would become statistical. Dot Maps. To show density: Such a map as the one last described is called a small-dot or point-dot map. The use of this type is illus- trated in Figure 36-A, which shows the location of filling stations in the United States in 1935. Each dot represents 20 filling stations in a county, and, since the information was available, they have been clus- tered within actual county limits, even though county boundaries are not shown on the map. In some states the dots are so close together that it is impossible to count them, but a clear impression is gained of the general distribution of filling stations throughout the United States, that is, of the relative density of filling stations in each state and in various sections of the country. If too large an amount is represented by each dot, they will be so scattered that no great density is apparent in any subdivision; on the other hand if the unit is too small certain areas become entirely black. There is no intention that the dots should be countable in any subdivision but the unit must be so chosen that the effect of density will be clearly brought out. To show quantity: 'If the actual totals in round numbers are wanted, large dots are used instead of small dots, 'as illustrated in Figure 36-B. A certain effect of density is also provided by this map, but the dots are grouped in blocks by rows and columns, to facilitate counting, instead of being distributed throughout each state. The unit repre- sented by each large dot is so chosen that no area will contain too many dots to count easily, and none will be too crowded to contain all the dots it should have. If dots must be drawn in the Atlantic Ocean to FIGURE 36 FILLING STATIONS IN THE UNITED STATES, 1935 A. Point-Dot Map Showing Density ONE DOT REPRESENTS TWENTY FILLING STATIONS IN A COUNTY Reproduced from Market Research Series, No. 18, United States Department of Commerce B. Large-Dot Map Showing Number 500 FILLING STATIONS 250 FILLING STATIONS Data from Census of Business, 1935 306 BUSINESS STATISTICS represent Rhode Island, Connecticut, Delaware, and other small states, the true spatial relationship is distorted. It should be noted that, except for the last dot in each state, each dot represents an exact amount. In this example, each solid dot stands for exactly 500 filling stations, except that the last solid dot stands for the round number 375-625 when no half dot has been added. When a half dot has been added, each whole dot represents exactly 500, but the half dot may represent any number from 125-375. There will never be more than one partial dot in any given area. Partial dots are sometimes subdivided into quarters; i.e.; f black indicates f of the total unit, J black indicates i of the total unit, and an empty circle may be used for less than i of the unit amount. However, this practice tends toward greater precision than is necessary in a graph. The use of large dots is similar to the method of equal-sized symbols illustrated in Figure 35-D, and practically the same effect would be secured if symbols of gasoline pumps were used in Figure 36-B instead of large dots. Other forms of large dot maps can be found in print, in which the large dots are not all uniform. For example, instead of 5 equal-sized dots to represent 2,500 filling stations, a single dot 5 times as large might be used. This method involves the difficulty of estimating the relative areas of circles. In other cases there may be an attempt to show two or more different sets of information on the same map by means of large dots that are equal-sized but in several colors or shadings. This method is likely to result in a confused picture of spatial relationships instead of one that stands out clearly. Any deviation, therefore, from the solid large dot as illustrated in Figure 36-B is usually unsatisfactory for the purpose of showing geographical distribution of quantities. Ratio Maps. Although both small- and large-dot maps usually represent absolute quantities, there is an implied ratio even in these, because the quantities are distributed with relation to the area of each subdivision. This is particularly true of the small-dot or density map. The actual space covered by the area of each state is, in a sense, the denominator, and the number of point dots in each is the numerator. The resulting effect of density becomes a pictorial representation of the ratio, number (of some unit) to area. In Figure 36-A, for example, density increases either when the denominator (state area) is decreased or when the numerator (number of dots or filling stations) is increased. GRAPHS 307 However, the purpose of a statistical map is often to show ratios in which the denominators are some values other than areas; for ex- ample, retail sales data might be used to show percentage of change over the preceding year, or sales per capita. Some pictorial device other than dots must therefore be used to summarize these ratios for spatial comparisons between the different sections of the map. The most usual method is by shading or cross-hatching, as illustrated in Figure 37. Principles of cross-hatching: The types shown in the example are by no means the only possible kinds of cross-hatching, but they illus- trate the principles involved. (1) The smallest ratio, i.e., the least density of occurrence of the characteristic, is best represented by white, and the largest by solid black, since these afford the greatest range of contrast. (2) The degrees of density should be in unmistakably ascending order from white to black. (3) This gradation is ac- complished by increasing the heaviness of the lines, decreasing the space between them, or both; crossing them and filling in the inter- vening spaces are the next steps. Changing only the direction of the lines does not have any effect on the density of appearance. Alter- nate heavy black and white stripes may or may not appear darker than crossed or plaid effects and should therefore not be used in com- bination with them. The use of dots is undesirable for two reasons: it is often difficult to compare the density effect with that of lines, and there is danger of confusion with the point-dot type of map. Interpretation of a cross-hatched map: In Figure 37, the same data that were used in Figure 36-B are expressed as ratios to the population of each state. 8 There is no unmistakable impression to be gained from a single glance at this map, except that in the states west of the Mis- sissippi River the ratio of filling stations to population tends to be higher than in the east. It is highest in the central farm states, in Washington and Oregon, and in Florida. The last named is most easily explained since the presence of many out-of-state cars produces a need for more filling stations than the native population would re- 3 Note that the purpose of this map is to show the density of filling stations with relation to the population. In order to avoid fractions of filling stations, the denominator has been expressed as "per 10,000 persons" instead of "per capita." An alternative method for stating the ratios as whole numbers would be to invert them, using "number of persons per filling station." If this were done, 400 persons per station would nv?an greater density of filling stations than 1,000 persons per station. The lower ratio would then more properly be represented by dark shading and the higher one by light, resulting in the same total effect as in Figure 37. O g C o I a.' O Q w u H DC 6 GRAPHS 309 quire. A study of the degree of car ownership and the miles of sur- faced roads per capita in each state would aid in interpreting these ratios, but even a knowledge of these total state data is not enough. Neither will explain why such rural and relatively poor states as Mississippi, Alabama, Georgia, Tennessee, and Kentucky should fall in the same category as wealthy, industrial, and urban New York, Pennsylvania, and Massachusetts. It takes careful study of the situation to realize that there is one pertinent factor common to these two kinds of states: a large pro- portion of the population does not own cars, and most of the car owners drive only short distances daily. This is the reverse of condi- tions in the comfortably prosperous farm belt and also in the large western states where distances are great. In the largest cities, such as New York, Chicago, Philadelphia, Baltimore, and Boston, traffic con- gestion is so great that a small proportion of families owns cars in comparison with small town and rural families in the same income groups. Thus the state ratio of filling stations to population will be reduced accordingly in New York, Illinois, Pennsylvania, Maryland, and Massachusetts. In the poorer regions of the south, comprising the greater part of the white and lightly hatched states, incomes are in general too small to lead to car ownership. North Carolina with its growing industries and improved roads is an exception in this area. Because of these different factors that cause both numerators and denominators to fluctuate in the ratios depicted on this map, the pic- ture presented is not as clear cut as can sometimes be achieved when the rates or other ratios pictured are definitely affected by a single condition of topography, climate, transportation facilities, concentra- tion of population and industry, etc. In any case the planning of an effective cross-hatched map requires in a marked degree the co-ordina- tion between artistic ability and statistical judgment which is stressed in the last part of chapter XIV. The rates, prices, per cents, or other ratios should be so grouped that if there are any significant spatial relationships they will stand out in the finished product. The choice of the right groupings is a very important factor in achieving this end. Number and width of size groups: The first step in making a cross- hatched map is to work out the individual per cents or other ratios that are to be shown. These are next arrayed in order of size and studied 310 BUSINESS STATISTICS for any logical grouping that may be seen. If no natural dividing points are obvious, the items should be grouped so that an approxi- mately equal number will fall in each category. In Figure 37, the number of states in each group is as nearly equal as the data permit 10, 11, 12, 9, and 7 although the size groups are of uneven width under 12, 12-17, 17-20, 20-22, and 22-27. This is a departure from the rules that will be noted later in chapter XV for the presentation of frequency distribution tables and graphs. 4 The number of groups has in this case been limited to five. More than six or seven kinds of cross-hatching are seldom effective on a map, and, in order to emphasize the contrasts, it may be desirable to have as few as three magnitude classes. Special varieties of ratio maps: Certain conditions that are not con- fined within political boundaries may also be indicated by areas of cross-hatching. This kind of map is used chiefly to show belts of rainfall, crop conditions, etc., all of which are rates, per cents of normal, or other ratios grouped in class intervals. Non-statistical terms, as "good," "fair," and "poor" are sometimes used instead of a numer- ical measure for indicating crop, weather, or business conditions, but any such classes should be based upon numerical standards generally understood or defined in accompanying notes or text, and the rules stated above for ratio maps should govern the plan of shading. Flow Maps. These maps use a device not previously named, that is, numerical values are represented by the width of lines instead of by their length. The direction of the lines adds a non-numerical spatial relationship. This method has been found valuable chiefly in studies of traffic density. The same idea is followed in connecting areas of supply with areas of distribution. This is illustrated in Figure 38 showing the flow of exports from the United States to Canada and to other continents. Each line represents 5 per cent of exports, and the width of the combined lines indicates the proportion going to each area. An alternative method utilizes circles of varying sizes to express the quantities at the point of supply, with arrows indicating the direction of distribution. Either method may be employed in simple diagram- matic form instead of on a map as background. 4 The entirely different graphic presentation of similar data in a frequency diagram requires class intervals of equal widths, whereas the number of items varies from class to class. See Bruce D. Mudgett, Statistical Tables and Graphs (Boston: 1930), Hough- ton Miffim Co., pp. 179, 187-89. GRAPHS 311 FIGURE 38 FLOW MAP: UNITED STATES EXPORTS, 1931 ONE HALF OF OUR EXPORTS GOES TO EUROPE I 1% TO AUSTRALIA INCLUDED IN AFRICA EACH LINE EQUALS 5% OF TOTAL VALUE OF EXPORTS FROM THE UNITED STATES IN 1931 Circle Graphs For certain purposes a circle graph is superior to bars or other linear measures. However, by its very nature it is adapted to but few usages. Parts of a Total. -4Thz circle is a classic symbol of unity; hence it is ideal for representing total values. When divided into sectors, as in Figure 35-B, the same visual effect is given whether these values are expressed in per cents or as actual amounts.) The actual total of cash farm income was $8,000,000,000; the part contributed by crops was nearly $3,200,000,000 or 40 per cent of the total, but whichever way it is designated its sector will cover about 144 degrees, or | of the circle. Furthermore, since the angle measured by a given number of degrees is the same regardless of the size of the circle, it is possible to use two or more circles whose areas 5 are proportional to compare the absolute amounts of several totals, yet the number of degrees in the corresponding sectors will also be comparable. This advantage is one not possessed either by linear bars or rectangular areas in repre- senting parts of several totals. However, there is a certain tendency 5 The areas of circles are in proportion to the squares of their radii. 312 BUSINESS STATISTICS for the eye to measure the entire area of a sector instead of its angle. Therefore, if the emphasis is on comparison of the proportions of corresponding parts rather than on difference in total magnitudes, it is probably better to use equal-sized circles. If the graph includes two or more circles, there should be a starting- point common to all of them, usually the radius extending upward from the center, although other quarter positions are also permissible. In each circle the various sectors should follow the same order, clock- wise, around to the starting-point. The cross-hatching of sectors usually indicates different attributes or geographical divisions rather than quantitative differences, as in the case of cross-hatched maps; hence an ascending scale of density is not necessary. A better contrast is achieved if dark sectors alternate with light ones, as in Figure 35-B. The kinds of cross-hatching to be used therefore need not be definitely distinguished in degrees of density so long as no two sectors look too much alike. Dots are permissible and concentric curved lines are also used. Diagonal lines, however, should follow the same direction on the page in each circle, regardless of the direction of the radii in the sector. A key to the cross-hatching may be used instead of printing the legends in each sector, particularly when there are two or more circles. Dial Indexes. During the depression decade when practically every measure of business conditions was below normal, it became customary to use circle diagrams to show percentage of normal, the entire circle representing 100 per cent. These graphs were easy to understand but basically incorrect. In this case 100 per cent is not a total or maximum value but merely indicates a normal or average condition; it is possible for any index to rise above 100, whereas a circle can never represent more than 100 per cent. After some attempts to show an extra "piece of pie" on top of the whole, or bulging out at one side, the dial form of index was developed. In these, each circle measures to 100, with 100 at the top, but the scale extends around to 120 or 150 if necessary in place of 20 and 50. Several pointers from the center mark the value for this month, last month, last year, etc. This method of showing index values at selected periods is quite effective, easy to read and affords correct comparisons. Another variation, instead of reproducing the entire dial, shows only a section of it, as illustrated in Figure 39. This makes it possible to use a much larger scale dial without requiring the space for a large complete circle. GRAPHS 313 FIGURE 39 DIAL CHART: INDEX OF INDUSTRIAL ACTIVITY AS OF MAY 31, 1941 ASSOCIATED PRESS INDEX OF INDUSTRIAL ACTIVITY COMPONENTS: AUTO PRODUCTION STEEL OUTPUT COTTON MFC THIS LAST WEEK 129.0 WEEK 130.0 137.5 170.0 137.7 166.0 THIS UST ELECTRIC POWER RESIDENTIAL 8LDG. RAILWAY FREIGHT WEEK 144.4 101.8 90.8 WEEK 143.3 105.0 91.3 Reproduced from Buffalo Evening News. Linear Graphs Practically any numerical value can be represented by the length of a line, and consequently bars and other adaptations of linear graphs are more widely used than any other form of statistical graph. The more complicated forms, involving time and quantitative relationships will be described in the next two chapters. This section deals only with the simpler types in which there is but one scale, which may extend either vertically or horizontally. Pictograms. As has already been suggested, a pictogram of the kind shown in Figure 35-D is the simplest form of linear diagram. It dates from prehistoric times, when primitive peoples drew five canoes or ten moons to represent such concepts as quantities or the passage of time. Where standardized symbols of very simple form, all the same size and evenly spaced, are shown in rows, each symbol represents a certain number of original units and several groups of them can be compared according to the lengths of the rows. The example in Figure 35-D appears to have no scale, but actually there is one, horizontally. The advantage of this graphic device as a means of distinguishing attribute classifications has already been noted. When 314 BUSINESS STATISTICS the method is carried to the extreme of using several variations of the symbol in each row and other symbols at the ends of the rows to indicate the different attribute classifications^ as illustrated in Figure 40, the graphic requirements of clarity, simplicity, and effectiveness are likely to become lost in confusion. Many pictogram forms that enliven the reports of governmental or other agencies do not fall within the scope of a discussion of statistical graphs; they may succeed in attracting attention but they do not convey numerical ideas. If they do aim to present statistical material the only acceptable method is the linear form, made up of identical symbols. No statistical misconception can result from such graphs, as does occur through the use of symbols or pictures of different size or those not identical in form. 'The rules governing the use of pictograms can be summed up as follows: FIGURE 40 PICTOGRAM: NUMBER OF WORKERS IN BASIC FIELDS OF EMPLOYMENT, 1940 WHAT KIND OF WORK DO THEY DO? MANUFACTURING AND MINING O AGRICULTURE O O O O O O WHOLESALE AND RETAIL TRADE GOVERNMENTS CIVIL AND MILITARY OTHERS TRANSP. FINANCE AND SERVICE OTHER Each symbol represents one million workers PREPARED BY PICTOGRAPH CORPORATION Reproduced from N*w York Times Sunday Magazine, April 13, 1941. GRAPHS 315 / 1. Symbols should be self-explanatory. 2. Larger quantities are shown by a larger quantity of symbols, not by larger symbols. 3. Charts compare approximate quantities, not minute details. 4. Only comparisons should be charted, not isolated statements. 6 Bar Graphs. In Figure 35-A plain continuous bars serve the same purpose as rows of symbols. However, to most readers the bars are just as easy to understand if not more so. " Single bars: 'Single bars are often ased to depict the separate classes of an attribute classification. They are also suitable for showing geo- graphical classifications in which the spatial relationship of one class to another is not so important that a map is required; 'the three cities in Figure 35-A are an example. Such bars may also indicate values at selected periods of time that do not constitute a continuous time series. Each bar always represents a number or value for a single attribute or place or period. The bars are arranged along a base line which has no scale but which is labeled just like the stub of a table. The same prin- ciples as in tabulation determine the order of arrangement of the bars. That is, they may be in ascending or descending order of size, alpha- betical, or in any other logical order. Groups of bars: As in a table, bars are often arranged in sub- classifications grouped according to whatever emphasis is desired. Solid black may be used for all the bars, whether single or in groups. When there are several subgroups, each containing the same classes of items, cross-hatching is frequently used to identify corresponding bars in each group. As in the case of circles and sectors, this hatching is merely a means of distinguishing each attribute from the others, and no order of density is prescribed except that the same order should be followed in all the groups. Figure 41-A uses this method to compare the num- ber of wage earners in three leading food industries, at three census periods. Divided bars: In Figure 41-B 7 each bar is subdivided into a number of segments. This type of bar graph is used for the same purpose as the circle and sectors, to indicate parts of a whole, and the rules for cross-hatching the parts are the same as for the sectors of a circle. In planning the graph, it must first be determined which is more im- portant, to present an accurate picture of the total values, or to afford 6 Rudolf Modley, How To Use Pictorial Statistics. New York: Harper & Bros., 1937. 7 Reproduced from Survey of Current Business (December, 1937), p. 13. FIGURE 41 TYPES OF BAR GRAPHS NUMBER OF WAGE EARNERS IN THREE LEADING FOOD INDUSTRIES U.S. CENSUS OF MANUFACTURES 1914, 1929, 1937 THOUSANDS OF WAGE EARNERS 200 100 iff] r^ *& 200 100 1914 _ 1929 BAKERY PRODUCTS ESS MEAT PACKING CANNED FRUITS & VEGETABLES 1937 OCCUPATIONAL DISTRIBUTION OF NONRELIEF FAMILIES SAMPLED BY DEPARTMENTS OF AGRICULTURE AND LABOR, 1936 1 METROPOLIS [CHICAGO] 3 LARGE CITIES 9 MIDDLE SIZE CITIES 25 SMALL CITIES 107 VILLAGES WAGE EARNERS CLERICAL BUSINESS & PROFESSIONAL PERCENT OF NONRELIEF FAMILIES IN LARGE CITIES IN EACH OCCUPATIONAL GROUP, 1936 WAGE CARNERS AN PROFESSIONAL PERCENT OF FAMILIES ABOVE AND BELOW $750. INCOME PERCENT BELOW 100 80 60 40 $750 ANNUAL^ INCOME 20 20 PERCENT ABOVE 40 60 80 U R A L U R B A 100 80 60 40 20 20 40 60 80 NEGRO FAMILIES EZZ WHITE FAMILIES 100 GRAPHS 317 a correct cross-comparison of the proportionate distribution of parts. If the former, the original units must be shown on the scale and the bars will be of varying total lengths. If the latter, as in the illustration, the scale will be in per cents and, if all the parts are included in the graph, all bars will have the same total length, 100 per cent. When part-to-part comparisons are wanted in a single distribution the total bar may be cut into parts, each one starting from the zero base, instead of being arranged consecutively in one divided bar. This method is illustrated in Figure 41-C, which is a rearrangement of one of the divided bars in Figure 41-B. When the parts of only one total are shown, as in this case, the graph would present a better appearance if all the bars were solid black. However, if all of the bars of B were reproduced in the form of C, cross-hatching would be necessary and the entire graph would resemble the groups of bars in Figure 41-A. Duo-directional bars: In this type of linear graph, the single scale extends in both directions from zero. It has two main uses: (1) to show percentage change among a set of comparable items some of which may have increased while others decreased, and (2) to com- pare values classified according to two contrasting attributes, such as male and female, Republican and Democrat. The example in Figure 41-D 8 is a modification of this second usage and also of the types shown in 41-A, B, and C. The contrasting attributes are incomes above or below a certain standard, $750 per year. There are two groups of bars, urban and rural, each group containing a bar for white and a bar for Negro families. Each of these four bars is actually a divided bar, that is, its total length represents 100 per cent, divided into the percentage of families having an income above $750 per year and the percentage below $750. The bars are aligned at this dividing point as zero instead of showing their total length measured from a common base. The method is very effective in emphasizing the particular com- parison that is wanted in this case. Figure 76, page 548, is an example of a duo-directional bar diagram showing increases and decreases. Essential features of bar diagrams: Unbroken Scale: A bar graph gives an accurate impression only when the relative lengths of the bars are correctly represented. It follows that the scale in every case must start at zero and continue as high as the highest value to be 8 The original form of this graph was a pictogram, in the Consumers' Guide, United States Department of Agriculture, September, 1938. 318 BUSINESS STATISTICS shown on the graph. If it starts at some point beyond zero, or if any break is made in the scale, all of the bars become shortened by equal instead of proportional amounts, and consequently a misleading pic- ture is given of the relative total lengths of the various bars. Some statisticians consider that any labeling or figures at the far end of each bar, or on the bars, also interferes with the estimate of their lengths. However, if the figures are inserted near the zero end of the bars, there can be little objection. In the case of horizontal bars, such inserted figures appear as a column, taking the place of an accom- panying table. 7 Shading, and Spaces between Bars: Bars are most effective when solid black. White narrowly outlined in black is least effective since, unlike sections of a map or circle, every bar is entirely surrounded by a white background. Adjacent bars must be separated by white spaces, usually slightly narrower than the bars themselves. If they were not separated an effect of area rather than length would result, and it would be hard to estimate correctly the lengths of individual bars. Somewhat wider spaces are used to separate groups of bars, or each group may be boxed separately in a complete border, as in Figure 41-A. Scale and Labels: In the simple types of bars thus far described there is only one scale but this scale should always be shown. There is no fixed rule as to whether the bars should extend horizontally or vertically. If they are vertical, the scale is often repeated on the right side as well as on the left, and if horizontal it may be either at the top or bottom, or both. The horizontal position is usually more con- venient for graphs containing only a few bars. It is possible then to print the label of each bar at the left of the vertical base line, with the numerical value also if desired, and sufficient space can be allowed in a natural horizontal position for all labels, figures, and scale values. However, when the bars represent a time classification, even though not constituting a time series, it is customary to draw them in a vertical position/ as illustrated in Figure 41-A to correspond with the accepted form for time series bars, which will be explained in the next chapter. INTRODUCTION TO TWO-DIMENSIONAL LINEAR GRAPHS Every graph drawn on a plane surface has, of course, two dimen- sions. In this text, however, the simple types of linear graphs that have been discussed in the preceding section are distinguished from the GRAPHS 319 more complex types in which there are two scales of values, one ex- tending horizontally and the other vertically. Definitions The term "two-dimensional" 9 will be applied to the latter type of graphs. Whenever the data consist of series of two or more variables, a two-dimensional linear graph must be employed. Before describing the principles and methods of constructing this kind of graph, some definition of terms becomes necessary. Variables. This word has occasionally been used in preceding chapters, without definition. According to Day, 10 "A variable is any- thing which exhibits differences of magnitude or number." It is used to refer to any column or row of data indicating changes in number or value of the particular unit named in the heading, e.g., in Table 15, page 149, the price of wheat is a variable. An ordered classification is also considered a variable. That is, classifications by size groups, time classifications according to regular periods or intervals, and even quali- tative attributes that can be arranged in numerical order, such as age groups, are all variables. In Part A of the wheat-price table, the stub classification "time in weeks" is therefore a variable; the classifications in Parts B and C are not variables, however, since neither "grades of wheat" nor "market" depend upon differences in magnitude or number. Statistical Series; Dependent and Independent Variables. When two such variables are shown in relation to one another the resulting table of data is a statistical series. Thus there may be series in time or series according to attribute. 11 Variables in a series are further defined as "dependent" and "inde- pendent," the unit usually being the dependent variable and the classification the independent variable; e.g., the price of wheat is the dependent variable and time in weeks is the independent variable. Under some conditions both variables could be considered independent, in which case either one could be classified in terms of the other. For example, instead of quoting prices according to the time classification, 9 Not to be confused with "duo-directional" in which a single scale extends both positively and negatively from zero; nor with "double or multiple scale," a term to be introduced later referring to two or more scales of units both measured vertically. 10 Edmund E. Day, Statistical Analysis (New York, 1925): The Macmillan Co., p. 10. 11 Series in space are also suggested by Day (ibid., p. 45), but this possibility seems not wholly consistent with his definition of a variable, since no geographic classification "exhibits differences in magnitude or number." 320 BUSINESS STATISTICS weeks, a frequency distribution of the same data might use price range as the stub classification, the unit being the number of weeks in which each price was quoted: PRICK RANG* NUMBER OF WEEKS $.60-1.649 3 .65- .699 10 .70- .749 22 etc. In the discussion of classification in chapter VIII it was stated that from the point of view of tabular arrangement no distinction need be made between series and other kinds of classification. In graphic development, however, the treatment for a series of two variables, whose relations to one another are of primary importance, is quite different from the methods that have been described in this chapter for a single variable classified non-numerically. Two-Dimensional Scales. In order to measure graphically the values of the two variables in a series, two scales are required on two axes that intersect at right angles. The vertical axis corresponds to the single scale described in the preceding section on linear graphs and is used to measure the number of units in the dependent variable, that is, the numerical values of the data themselves. In other words, it represents the figures that appear in the rows and columns of a table. The horizontal scale in similar fashion measures the independent variable, that is, the numerical values of the classification. 12 Ordinarily such a graph uses only one quarter of the field covered by the intersecting axes, that is, the "positive" field above and to the right of the point of intersection. If negative values are necessary on the vertical scale, the field below the intersection must also be shown, and similarly the left-hand field for negative values on the horizontal scale. For use in formulas and diagrams, the dependent variable is referred to as Y and the independent variable as X. The Y variable is called a function of the X variable. The student will do well to familiarize himself with all these uses of the term "variable" because they become increasingly important in advanced statistical analysis. Types of Statistical Series and Their Graphs When the classification indicates quantitative attributes, the units in the dependent variable are called "frequencies." Graphic methods for 12 Certain exceptions to this rule will be found in some special purpose graphs, such as price curves in the field of economics. GRAPHS 321 illustrating analysis of this kind of series will be described in chapter XV. Series in which two quantitative attribute classifications are re- lated to one another lead to analysis by correlation, the subject of chapter XXVII. The principles of time series and the methods for representing them graphically are discussed in the first half of chapter XIV. In the majority of texts it is customary to reserve the use of the word "series" for time series, and to refer to other kinds of series as "groups." Consequently that practice will be adhered to in subsequent chapters of this book. The general term "series" was introduced in this section to point out a basic similarity in all data involving func- tional relationships between two variables. A two-dimensional scale is necessary for graphic representation of any such relationships. PROBLEMS 1. What is the principal advantage of a graph as contrasted with a table for presenting information ? 2. State briefly the relations shown in each part of Figure 35, page 299. 3. Find in print a graph of one of the types presented in Figure 35. Analyze on the basis of our text the steps followed by the author of the graph in its preparation. 4. Select from any issue of the Statistical Abstract tables which should be represented by each of the four types of statistical maps described in the text. Explain why each set of data could be represented best by the type of map you have selected. 5. a) Present the information given below in the form of circles and sectors. b) Discuss the difference in use of fuels by iron and steel, and by all industries. COST OF INDUSTRIAL CONSUMPTION OF PURCHASED ELECTRICITY AND OTHER TYPES OF FUEL, FOR ALL INDUSTRIES, AND FOR IRON AND STEEL, 1929* TYPE OF FUEL Co (MILLIONS < ST >F DOLLARS) ALL INDUSTRIES IRON AND STEEL Total 1,973.9 463.1 Purchased electricity 719.5 128.2 Bituminous coal 754.5 87.2 Anthracite coal 43.6 2.5 Coke 243.7 198.2 Fuel oils 212.6 47.0 * 1930 Census of Manufactures, Vol. 1. p. 161. \22 BUSINESS STATISTICS 6. Given the following information concerning the age distribution of all persons over 10 years of age in the United States and those gainfully employed in each group: MILLION! 1900 1920 1930 Total population 10 years of age and over. Total gainfully employed 57.9 29.1 9.6 1.7 34.7 20.2 10.4 5.8 3.1 1.2 82.7 41.6 12.5 1.1 48.1 28.9 17.0 9.9 4.9 1.7 98.7 488 14.3 .7 56.3 33.5 21.4 12.4 6.6 2.2 Total population 10-13 years of age Total gainfully employed Total population 1644 years of age Total gainfully employee! Total population 4564 years of age Total gainfully employed Total population 65 years of age and over. . Total gainfully employed a) Study the changes in the age composition of the employed population during the 30-year period, and draw a graph to illustrate your con- clusions. b) Study the changes in the percentage of each group gainfully occupied, and draw a graph to illustrate these changes. c ) Write an interpretation of the data illustrated by your graphs. 7. a) Present the following information graphically. b) Discuss the nature of changes in this business during the ten-year period. SALES IN A COUNTRY GENERAL STORE ACCORDING TO TYPE OF GOODS, 1930 AND 1940 TYPE OF GOODS 1930 1940 Groceries $19,650 $21,410 Meats 400 975 Shoes 1 125 630 Rubber footwear 760 925 Dry goods 1,650 310 Notions 1,850 1,025 Hardware 925 1,070 Drugs 425 115 REFERENCES (See page 348) CHAPTER XIV GRAPHS Continued TIME SERIES GRAPHS THE passage of time is most naturally pictured by a moving point whose apparent course may be traced from left to right. This basic idea is utilized by all types of time series graphs. Bar Graphs A time series may be represented by a row of bars, the height of each bar representing a single value, exactly as in the one-dimensional graphs. However, the order of arrangement no longer depends on judgment or arbitrary choice; the bars must stand at evenly spaced intervals along the base scale, the units of which represent successive equal periods of time. The fluctuations of the dependent variable may be followed through the path marked by the tops of the bars. This is illustrated in Figure 42, which shows the changes in value of United States exports annually, 1922-39. FIGURE 42 BAR GRAPH OF TIME SERIES VALUE OF UNITED STATES EXPORTS, 1922-39 BILLIONS OF DOLLARS 50 45 4O 35 30 25 20 15 10 .5 I III ii i HUM LI i ii ii ii I I I Mill . II lilt I I 1922 1924 1926 1926 1930 323 1932 1934 1936 1938 Data from Statttttcal Abstract 324 BUSINESS STATISTICS Band, Strata, or Surface Graphs Bars that represent a time series may be divided into several components. A long row of divided bars gives an effect of wavy hori- zontal bands or strata. To serve the same purpose, these several com- ponent parts of the variable are frequently shown as a continuous "surface" instead of in separate bars. The strata or bands in the surfaces are cross-hatched the same as divided bars, and show by con- trast the changing proportions of the parts of a total over a continuous period of time. This type of chart may be designed to show percentage distributions, in which case the graph consists of a rectangle completely filled in with bands of fluctuating width. If the scale represents actual values instead of per cents, the upper boundary of the surface will be irregular, representing the actual total at each period. The two types are illus- trated in Figure 43-A and 43-B, both of which show the same total data as Figure 42, divided into component parts. Whether per cents or actual values are shown, there is some danger that the bands may take on a distorted appearance due to sudden extreme fluctuations in some of the parts. The width of the band must be estimated by the vertical distance between its boundaries at each point. In the illustration at the right the width is actually the same throughout the period, but due to various angles of change in its lower boundary, it is pulled out of shape. If the data do not contain too many sudden changes, this distortion may be reduced to a minimum by charting at the bottom of the graph the narrowest band and others having but slight fluctuation, so that succeeding bands will have a lower boundary that is fairly level. The upper one, that which fluctuates the most, will then not affect the shapes of the other layers. The smoother strata can be located near the top and bottom in a 100 per cent graph. The jagged edges of the two most irregular strata will then fit together near the center, theii widths being measured from the straighter edge of each. In some cases, however, a required sequence of the component parts based on other considerations of the data will determine the order of the bands. FIGURE 43 BAND GRAPHS OF TIME SERIES VALUE OF UNITED STATES EXPORTS ACCORDING TO MAJOR ECONOMIC CLASSES, 1921-39 A. CUMULATIVE PER CENTS PERCENT 100 1923 1925 1927 1929 1931 1933 1935 1937 1939 B. CUMULATIVE DOLLAR VALUES BILLIONS OF DOLLARS 1921 1923 1925 1927 1929 1931 1933 1935 1937 1939 Data from Statistical Abstract 326 BUSINESS STATISTICS A well-planned band, strata, or surface chart is a valuable means for showing the changing relations of the component parts to the whole in a time series. On the other hand, if the primary purpose is to place emphasis on time comparisons between individual parts, or between separate totals, a line graph is preferable. Line Graphs of Time Series Graphs of this type are unquestionably more widely used than any other in every phase of government and business statistics. They are constructed in exactly the same way as simple bar graphs of time series, except that, instead of drawing a bar for each time period, only the point at the upper end of each bar is plotted. 1 The successive points are then connected by straight lines whose combined length may have a more or less jagged appearance, depending on the irreg- ularity of the data. Regardless of its degree of smoothness, this con- tinuous line is called a "curve." The chief advantage in using curves instead of bars is that several curves may be shown for comparison on the same graph more easily than several sets of bars. Curves of time series may serve either of two purposes: to show actual amounts of change or to show relative changes. Whichever function is more important will determine whether the vertical scale should be arithmetic or logarithmic. Arithmetic Scale. An arithmetic scale in which equal spaces on the scale stand for equal amounts of the unit is familiar to everyone. It should be used for the vertical scale whenever a comparison is wanted between actual amounts of the unit either for a single series at different time periods or for several series at corresponding periods. Methods of comparing several series: A problem arises, however, in comparing two or more series that are recorded either in different units, or in the same unit at levels so far apart that it is difficult to use the same scale effectively for both. The purpose of such a graph must be carefully considered before choosing one of the alternative graphic methods for dealing with this situation. Figure 44, A, B, and C, shows three ways of handling the same data on arithmetic scales. 1 In mathematical language, the positions assigned TO vaTues of the independent variable along the horizontal axis are called "abscissas" and the values of the dependent variable assigned along the vertical axis are called "ordinates." The plotting of any value of the dependent variable consists in (1) determining the position of the independent variable on the horizontal scale (abscissa) and the value of the data on the vertical scale (ordinate) ; (2) locating the point of intersection of a vertical at the abscissa position and a horizontal at the ordinate value. GRAPHS 327 1. Single Unit Scale: If the primary purpose is to compare absolute amounts at each period, a single 2 unbroken arithmetic scale is best even though it minimizes the fluctuations of the series having the smaller values. In this case, if the purpose is to show that coal is still the major source of power, as compared with oil, Figure 44-A is the one to use. 2. Index Numbers: If a comparison of relative changes of the vari- ables will serve the purpose, and particularly when two or more series do not have a common unit, they may be reduced to indexes, using a corresponding base 3 period in each case. The 100 per cent line and the entire per cent scale will be common to the several series. Figure 44-C shows the relatively greater percentage variation of oil as a source of power, 1906-10 being chosen as the base period. 3. Scale Equation: If it is essential to depict on the graph the actual rather than the relative values, and at the same time to show on equal terms the degree of fluctuation in each series, some form of scale equa- tion must be utilized. Several methods are commonly used but the only one that is justified statistically 4 is the equation of the several series of values so that their respective averages for the period will approximately coincide on the graph. The method is as follows: (a) Find the arithmetic average of each series, in its own unit, for the entire period. (&) Find the ratio between the two averages. In this example, the averages were 8.7 quadrillion B.T.U. for coal consumption and 1.62 quadrillion B.T.U. for oil, or roughly one unit for oil to five for coal, (r ) Separate vertical scales are drawn on either side 5 of the graph, both starting at zero. In Figure 44-B, coal is on the left and oil on the right (^/) The same space that represents five units on the left (coal) represents one unit on the right (oil), and the approximate averages of the two coincide. (<?) In arranging the equated scales, they should not be condensed so much that fluctuations are underemphasized, but the maximum values of both sets of data must be provided for. 2 This method, of course, is available only when the two series are in the same unit or can be reduced to a common unit. Otherwise the comparison of absolute amounts on a single scale is out of the question. 3 For the choice of base, and interpretation of the comparison, refer to the discussion of index numbers in chapter XIX, pages 483-86 and page 498. 4 The suggestion has been made by some statisticians that the scales be equated on the basis of dispersion. However, the entire purpose of scale equation is to compare the degree of fluctuation between two series of data, and if this dispersion is equalized there appears to be no useful comparison shown by the graph. 6 If three series must be equated it is necessary to have three separate scales, each properly labeled. Customarily two of the scales are placed at the left and one at the right, but occasionally all three will be found at the left With more than three series the graph becomes too involved to read easily 328 BUSINESS STATISTICS FIGURE 44 LINE GRAPHS OF TIME SERIES SUPPLY OF POWER FROM COAL AND DOMESTIC OIL ANNUAL AVERAGES FOR FIVE-YEAR PERIODS, 1871-1935 (All figures represent equivalent British Thermal Units, in quadrillions) BTU BTU ABSOLUTE AMOUNTS -ARITHMETIC SCALE 1871-75 '76-'80 '81-'85 86-'90 91-'95 '96-1900 '01-'05 '06-10 '11-15 '16-20 '21-'25 '26-30 '31-'35 COAL OIL TWO ARITHMETIC SCALES EQUATED TO ANNUAL AVERAGES OF 13 FIVE-YEAR PERIODS 75 '76-80 '81-'85 '86-'90 '91-95 '96-1900 '01-'05 t>6 -10 '11-15 '16-20 '21-25 '26-30 '31-35 GRAPHS 329 FIGURE 44 (continued) LINE GRAPHS OF TIME SERIES INDEX 1600 C. INDEX NUMBERS ON 1906-10 BASE OIL/ COAL 400 1871-75 76-'80 '81--85 '86--90 91-95 '96-1900 '01-'05 'Ofi-lO '11-15 '16-20 '21-'25 26-30 '31-35 SEMI -LOGARITHMIC CHART IN QUADRILLIONS) 1871-75 76-'80 '81-'85 '86--90 *91-'95 '96-1900 '01-t>5 '06-10 '11-'15 '16-'20 '21--25 '26-30 31-35 Data from Statistical Abstract As a result of this method, each variable is given equal emphasis in terms of fluctuations from its own average, and they may be com- pared accordingly. For example, Figure 44-B shows that after 1916-20 oil-produced power expanded sharply above its average level, while in the same period coal-produced power fluctuated mildly with a slight tendency to decline toward its average level. The importance of the 330 BUSINESS STATISTICS increase in oil-produced power during this period is completely con- cealed in Figure 44-A. Specifically the increases in coal-produced and oil-produced power appear to be equally important in 44-A but 44-B shows the actual relation in the growth of use of the two fuels. 6 It should be noted that if index numbers are computed on an average of the period as a base, the shapes and relative locations of their curves will be exactly the same as by this more cumbersome method of scale equation. The only advantage of scale equation over index numbers is that actual amounts instead of percentages can be read from the graph. 7 4. Need for Logarithmic Scale: A comparison of relative rates of change, or relative increases or decreases from one period to another is likely to be more important than any of the three purposes named above. Such comparisons cannot be shown satisfactorily on an arithmetic scale, even by means of index numbers. Changes in the latter must always be studied with reference to a certain base period and cannot be shown equally between any two periods or at various levels on the scale. This is because equal amounts, or spaces, on an arithmetic scale represent constantly decreasing percentage changes as the values of the scale increase. For example, in Figure 44-C the index of coal consumption increased from 14 to 24 from 1876-80 to 1881-85, a relative increase of over 70 per cent, but the index rose only ten spaces on the scale; on the other hand, from 1921-25 to 1926-30 the oil con- sumption index rose from 375 to 518, an almost identical percentage increase (74 per cent), but 143 spaces on the scale were required to show it. In order to give an accurate visual conception of these relative rates of change, the arithmetic scale must be abandoned in favor of the logarithmic. This fourth method is shown for the oil-coal data in Figure 44-D and will be explained later. Breaks in time series scales: Vertical Scale: Strict accuracy re- quires that there should be no break in the arithmetic scale of a time series line graph any more than in the case of bars. However, it is often just as important to study the fluctuations in the variables as to compare the actual total values between the points of the curves. The lowest value of any of the variables may amount to several million 6 Students will be able to make this type of interpretation in greater detail after studying measures of dispersion in chapter XVIII. T Note Figure 78, page 555, in which two scales are equated to the values at the first period instead of to the respective averages. In this case the purpose of the graph is to show how the two series diverge. GRAPHS 331 units, and a scale covering a complete range between zero and the highest value to be graphed would become so small that a change of even several thousand units would cause no perceptible movement in the curve. Common practice, therefore, permits a "break" in the verti- cal scale below the lowest point needed for any of the points plotted. Zero is indicated as the base, then a double jagged line is drawn (using finer lines than the curves on the graph) and the scale may be resumed above the break at any value required. The break represents an actual tear in the paper, hence Hie vertical scale line and all grid lines are left blank between its boundaries. (See Figures 73- A, 74 and 75, pages 540, 543, and 546, respectively.) Likewise when the values of the series are in the form of index numbers, if the vertical scale is incomplete, a more careful study becomes necessary in order to estimate correctly the percentage of fluctuation. However, in order to enlarge the per cent scale, zero is frequently omitted altogether, the scale being extended below 100 only as far as the data require. In any case, the 100 per cent or normal line should be emphasized, since it is just as important a standard as zero. Horizontal Scale: So much emphasis has been placed on the fact that the regular intervals of a time series graph accurately depict the even progress of time that any suggestion of tampering with this scale may well be questioned. There is no abrogation of the principle that equal spaces on any arithmetic scale should always represent equal values. However, just as under some circumstances a break may be permitted in the vertical scale, there are also situations which justify changes rather than breaks in the time scale. In many business charts the main interest is in the presentation of current monthly or weekly data. Comparison with the past is a matter of only secondary importance, and it has therefore become customary to enlarge the scale for the current year and to contract the scale for previous years. If space is limited, and it is desirable to show some information regarding changes in the past for as long a period as possible, there are several methods for representing the earlier years in condensed form. Some possible alternatives are: (1) to show on a small scale monthly changes for several previous years; (2) to repre- sent only the annual averages by a single point in each year of the earlier period; (3) to use a series of vertical bars each of which indi- cates the range of fluctuation within a single year. Any of these meth- 332 BUSINESS STATISTICS ods gives a more complete and continuous story than can be gained from a chart in which there are either no data at all regarding previous periods, or else there is a complete break of several years with no indication of what happened in between. For the business man who has become accustomed to these forms there is no misrepresentation of facts. He has in compact form just the information he wants, and is well aware that the current year is a sort of "slow motion picture" in comparison with the previous scale. However, these procedures are not recommended for the student as methods for general interpretation. When they are used, the chart should be divided vertically into several segments separated by narrow white spaces to indicate the points at which the base scale has been changed. Figure 71, page 531, illus- trates two changes in the time scale. The main interest is in the weekly data shown on a large scale for 1940 and 1941; the preceding four years of monthly data are shown on a smaller scale; and for the earlier years the bars show the annual range and the average for each year. Logarithmic Scale. The logarithmic* or ratio scale is widely used as a graphic device because it permits equal spaces to represent equal percentage changes at any point on the vertical scale of a time series. The space between 100 and 200 is the same as between 50 and 100, 20 and 40, 6 and 12, 1.5 and 3, .005 and .01, and so on. Explanation of principle: Figure 45 illustrates the method by which the spacing on the logarithmic scale is determined. On the left (A) is an ordinary arithmetic scale, from to 2; in the center (B) is a column of logarithms, whose characteristics range from to 2, marked off at points measured according to the arithmetic scale (A) ; on the right (C) is a column of natural numbers which are the anti-logs of the logarithms opposite them on scale (B). These natural numbers in (C) are therefore spaced according to what is known as the logarithmic or ratio scale. The advantage of using scale (C) is based on the rule for multi- plication by means of logarithms: when two numbers are to be multi- plied together, their logarithms can be added and the sum will be the logarithm of the product of the two numbers. In Figure 45 the space marked a c on the center scale stands for the logarithm of 6; a b stands for the logarithm of 2. Hence if we wish to multiply 6 by 2 See Appendix C for explanation of logarithms, rules for their use, and table of logarithms. ARITHMETIC SCALE USED IN SPACING LOGARITHMS OF SCALE B A 20 FIGURE 45 CONSTRUCTION OF THE RATIO SCALE LOGARITHMS MEASURED BY SCALE A B 3000i 1954 -I 903 I 845 "l 778 1.7 f699 1.6 1.5 14 L I3g8 13 1 30, ^ ^ ^ I | ( Q7g 10 .954| .9 845 .8 ' 7 I .699 602 ,477 4 j. <398 3 L 301 .176 000 NATURAL NUMBERS ANT I LOGS OF SCALE B C loo 90 80 70 60 50 40 30 25 20 15 12 10 9 8 7 6 5 4 3 2.5 1.5 334 BUSINESS STATISTICS (that is, to double it or increase by 100 per cent) we can add a space c d, equal to a b, to the space a c, and we should arrive at a d, the logarithm of 12. This is precisely what does occur, as can be veri- fied by measuring with a ruler. Likewise the space from log 4 to log 8 can be measured and proved equal to the same space a b, as is also log 75 to log 150, log 300 to log 600, etc. In other words, adding the space a b, representing log 2, to any other value at any point on the scale (C) will multiply that value by 2, or increase it by 100 per cent. Now take the space on scale (C) measured by log 3. Added to itself, we reach log 9; added to log 10 we reach log 30; etc. In each case the original value of the anti-log is multiplied by 3 or increased by 200 per cent. The space measured on scale (C) by 1 to 10, or 10 to 100, is called a cycle. Every point in a cycle is ten times the value of the corres- ponding point in the cycle just below it, or represents 900 per cent increase. Percentages of decrease follow the logarithmic rule of division: the quotient is the anti-log of the difference between the logarithms of two numbers. Just as in the case of any percentage change computation, per cents of increase or decrease between two given points have dif- ferent bases and must be read differently. That is, log 50 plus log 2 == log 100, an increase of 100 per cent. But log 2 subtracted from log 100 = log 50; 50 is \ of 100, a decrease of 50 per cent. Similarly log 3 subtracted from log 90 = log 30; 30 is 4 of 90, a decrease of 66$ per cent. And log 50 subtracted from log 200 = log 4; 4 is J of 200, a decrease of 98 per cent. In the portion of logarithmic scale illustrated in Figure 45, only two cycles are shown, from 1 to 10 and 10 to 100. This scale can be extended upward, of course, to 1,000, 10,000, etc., and it may also be extended downward indefinitely to .1, .01, .001, .0001, but never can reach zero. 9 There is therefore no zero base on this scale, nor any other fixed point from which heights are measured. Hence only the portion of the scale that is used in plotting need be shown in the graph of a given series. Relative changes are measured by the distance between any two points on the vertical scale. Any alteration in the 9 Consequently the logarithmic scale cannot be used for a series that includes zero or negative values GRAPHS 333 slope of a curve, therefore, indicates a changing relative rate of change 10 in the data. A curve that follows a straight line upward is increasing at a constant relative rate. The possibilities of changing relative rates of increase in a curve, and corresponding relative rates of decrease, are illustrated in Figure 46. a) If it is convex upward, it is increasing at an increasing relative rate. b) If it is concave upward, it is increasing at a decreasing relative rate. c ) If it is concave downward, it is decreasing at an increasing rel- ative rate. d) If it is convex downward, it is decreasing at a decreasing relative rate. Because there is no fixed base line, scale equation between differing units presents no great problem. One unit can be shown in tens on one side of the scale, and another unit in thousands on the other side. The scale values can be adjusted at will in order to bring the curves to the relative positions that afford the most effective comparison of their slopes at various periods, provided the original ratio relationship set up by the logarithmic scale is not tampered with. This means that every value may be multiplied by the same number throughout, changing the cycles, for example, from 1-10-100 to 3-30-300, or 4-40- 400. Each cycle is still ten times the value of the cycle below it, and the intervening spaces also keep their original ratio values. It is sometimes convenient to make this adjustment by multiplying in order to bring the curves closer together or to bring all the values within one less cycle. For example, the series 8, 20, 36, 47, 80, 200 would require the use of three cycles 1-10-100-1,000. But if the scale is multiplied by any factor from 2 to 8, the series will fall within two cycles, 2-20-200 or 8-80-800. Example and interpretation: The advantage of the semi-logarithmic time series graph, that is, one in which the time scale is arithmetic and the vertical scale logarithmic, can be illustrated by a study of 10 When we speak of rates of change on a logarithmic scale, those rates are expressed in per cents They are really relative rates of change and the fact that they are expressed as per cents is usually taken to imply relative rates without employing the cumbersome terminology. However, we shall use "relative rates of change" in this text to keep the student constantly reminded that the rates of change are expressed in per cents. 11 Note that if the original values are moved up on the scale, as moving 1 to 2, 2 to 3, etc., or if an equal amount is added to each original value, the true relationship of the logarithmic values will be entirely distorted 336 BUSINESS STATISTICS 100 FIGURE 46 CURVES SHOWING CHANGING RELATIVE RATES ON A RATIO SCALE RATES OF INCREASE RATES OF DECREASE 100 '23 '24 '31 '32 J 33 '34 Figure 44-D. After looking at the three preceding graphs of these data, drawn on three different arithmetic scales, one scarcely knows which fuel has increased in use the more rapidly. The rates of change could be computed from A or B but certainly they are not readily apparent on either graph. From Figure 44-C which shows the index numbers, the percentages of change can be compared as related to a certain base period. This indicates that the use of oil has increased faster than that of coal, but if some earlier period had been taken as the base the relative increase in the use of oil would have been greatly exaggerated. With the logarithmic scale, however, the difference in the slopes of the two curves is apparent at a glance. From the first to the second five-year average we know that oil- produced power increased at a greater rate than coal-produced power because its curve slants upward more sharply. For the second five-year interval the use of coal increased a little faster than that of oil, but during the next period the two curves are practically parallel, hence the two were increasing at an equal relative rate. From this time on until 1926-30 the use of oil increased more rapidly during every period GRAPHS 337 except 1891-95 to 1896-1900. During the final period the consumption of both fuels declined, but coal more sharply than oil. Coal-produced power increased at an almost constant relative rate from 1881-85 to 1906-10, then it leveled off slightly for two periods, and finally turned downward. Its subsequent course has been a decrease at an increasing relative rate except for a minor recovery in 1926-30. Oil-produced power likewise increased at a constant relative rate from 1896-1900 to 1906-10; the line then leveled off, indicating a smaller relative rate of increase for the next two periods, after which it turned up and resumed its previous course for one more five- year period. This periodic rise in the curve can be measured on the scale and found to be equivalent to the distance between 1.0 and 1.7, or a 70 per cent increase. Computation either from the original data, Figure 44-A, or from the index numbers, 44-C, will prove this to be approximately correct for the three periods, 1896-1900 to 1901-5, 1901-5 to 1906-10, and 1916-20 to 1921-25. It scarcely needs to be added that since the slopes of the curves are so significant in this type of graph, neither of the scales can be tampered with in any way. Any omission of intervals or change in the time scale would entirely distort the slopes of the curves. Methods of making a ratio scale: Semi-logarithmic graph paper can be purchased in any needed number of cycles and with almost any arrangement of the base scale in time intervals. However, in order to use the ratio scale confidently, the student should understand how it is made and should be able to make his own scale if necessary. It can most easily be marked off with a slide rule if one is available. If a table of logarithms is at hand the simplest procedure is not to draw a complete ratio scale, but to plot the logarithms of each point on an arithmetic scale, just as scale (B) was plotted according to scale (A) in Figure 45. The plotted points will be equivalent to the anti-logs (natural numbers) plotted on a ratio scale, just as scale (C) was equivalent to scale (B). If neither of these aids is to be had, correct proportions may be obtained by plotting a geometric series 12 at evenly spaced intervals on the vertical scale using any starting point (/) and any common multiplier (f). Thus if t\ = 1 and r = 2, a scale of 1, 2, 4, 8, 16, 32, 64, 128, 256, etc., could be used for 12 In a geometric series the ratio of any term to the preceding term is constant. If /i denotes the first term, n the number of terms and r the common ratio, the series may be written, 338 BUSINESS STATISTICS plotting. Although the scale is accurate, the plotted curve may be some- what approximate because it is hard to determine exact values on such a scale. PLANNING GRAPHS FOR GENERAL EFFECT After having selected the kind of graph he intends to use to illus- trate his point, the statistician is ready to block out his actual plan for drawing. His degree of success at this point will be in direct proportion to his ability to combine artistic principles and technical skill with statistical acumen. Artistic Considerations Fortunately, no artistic genius is required in order to create an artistically effective graph. It is necessary only to understand the simplest rudiments that serve as guides to practically every form of artistic expression size, proportion, balance, and contrast. Size. The size will depend primarily on how the graph is to be used: is it to be published, or used for lecture purposes? A wall chart must be large and clear enough to be seen from any point in the room or auditorium. There is no use in preparing such a chart if it is too small. The lighting conditions under which it is to be shown must also be taken into account. If the graph is to be printed the size of the page will determine its final dimensions, but the original may be drawn from l to 3 or 4 times larger. Less meticulous care will be needed to draw it on a scale larger than its final form since small imperfections will dis- appear in photographic reduction. The amount of detail included is also a factor in determining the size of a graph. If only a single important relationship or a gen- eral condition is to be emphasized, one-half or even one-quarter of a page will suffice; whereas a more important or comprehensive graph will require a full page. If the graph includes a great deal of complex information, a variety of different kinds of lines, and detailed scales and legends, it may be necessary to use a folded insert even larger than the page of the book. Students who use prepared graph paper of standard size should remember that it is not necessary always to make a graph that will cover the entire sheet. Each graph can be made of suitable size and proportion by inclosing a part of the page within a bordei. GRAPHS 339 Proportion. The exact relation between the length and width of a graph is determined to some extent by the data that are being pre- sented, but in general there is a range within which a pleasing effect may be attained. If a graph is too long and narrow, either horizontally or vertically, it has an awkward, stretched-out appearance. Square graphs present a monotonous appearance and do not fit the page conveniently. The proportions will be within a pleasing range if the length is somewhere between 1J and Ij times the width. 18 Prob- ably the most convenient standard to use in preparing material for publication is known as "root two" or the ratio of 1.414 to 1. The long side is equal to the diagonal of a square drawn on the short side, and consequently if the rectangle is divided in half the resulting rectangles have the same proportions as the original, i.e., 1 to .707. A graph that is drawn with these proportions may occupy a whole page turned the long way, or reduced to half size will fit across half of the same space in normal position. Balance. The term "balance" as applied to a graph has the same meaning as in any other kind of picture. It is a term borrowed from physics to indicate that there is an approximately equal stress on either side of a central point. The statistician is not at liberty to select his data so that, for example, the peaks and troughs of his curves will balance artistically. He must therefore depend upon his auxiliary material if necessary to offset the appearance of an unbalanced set of data. To a certain degree he can enlarge one scale and reduce the other in order to alter the shape of his curve, although discretion must be used to avoid an exaggerated effect of fluctuation. If he has a set of bars that nearly fill the entire space, he will make them slender enough so that they will not bulk too large and heavy. If in spite of every effort a good fourth or more of the surface remains blank, he might print his title or key in that section, or insert a small table of the data. (See Figure 44- A and C.) Note, however, that // is never per- missible to insert printed material between any significant portion of the graph and its accompanying scale. Instead of using a key, if legends are printed close to each curve, it is usually possible to distribute them in clear spaces on the graph instead of bunching them all at the top or at one side. The addition of a border is a great aid in tying all parts of a graph together into a well-balanced whole, and is partic- 18 "Length" refers to the longer dimension which is most frequently the horizontal measurement; "width" is the shorter dimension, usually the height. 340 BUSINESS STATISTICS ularly necessary for maps, circles, and simple bars. (See Figures 35 and 36, pages 299-305.) Many graphs will be found in print that have no borders except for the page margins. These graphs are usually of the two-dimensional type, such as Figures 42 and 43, in which the limits of the grid itself bounded by the horizontal and vertical axes with the title of the graph printed above practically take the place of a border in marking out the definite space occupied by the graph. Contrast. Boldness is the secret of effective graphic presentation. Since the sharpest contrast can be achieved by the use of black and white, this combination is most commonly used in statistical graphs. Other reasons for preferring black and white are: (a) Graphs in color are much more expensive to reproduce, and some colors cannot be used in ordinary photostating. () Colors cannot be arranged in vary- ing degrees of intensity as unmistakably as can the standard types of black and white cross-hatching, (r) All readers do not evaluate colors in the same way, and some may even be color blind. Appropriate types of cross-hatching for various purposes have already been discussed and illustrated in chapter XIII by the circle in Figure 35-B, the map in Figure 37, the bars in Figure 41, and in this chap- ter by the strata charts, Figure 43. To sum up the general rule again: when cross-hatching is used to represent quantitative informa- tion, increasing magnitudes must be indicated by increasingly intense or dark types; if it is used only to distinguish one set of data from another according to some non-quantitative characteristic, any kinds of cross-hatching may be chosen that will afford the greatest possible contrast, usually by alternating light and dark types. It is possible to buy gummed paper printed in a great variety of cross-hatched patterns. This may be applied to the graph and trimmed to the desired shapes with great saving of time. Contrast may also be achieved by differentiation in types of lines when several curves are being presented. A number of possible types are shown in Figure 47. The most important data, such as a combined index, can most effectively be represented by a solid black line, and the lines for the other curves should be selected so that they are easily distinguishable from one another and so that all are equally distinct. The curves representing the data should be heavier than the back- ground lines on the graph. The usual order of these background lines is: border heaviest, followed by vertical and base scales and 100 per cent line, with other grid lines lightest. GRAPHS 341 FIGURE 47 TYPES OF LINES When symbols are employed in a pictogram, boldness should be the primary consideration. One or two kinds of solid black symbols of standard shape, whose meaning cannot possibly be misunderstood are much more effective than a variety of outline sketches that cannot be interpreted without reference to some printed material. The printing on a graph should be heavy enough to contribute to the effect of contrast, but not so heavy as to detract from the diagram itself. Vertical capital letters and figures with no ornamentation what- ever are most suitable for this purpose, and are most easily read. The heaviest lettering will be used for the title, a smaller size for the legends, scales, etc., and probably the smallest of all for the reference to the source or other notes of explanation. It should go without saying that neatness in lettering is one of the most essential features of an artistic graph. Technical Details Certain techniques in graphic construction have come to be accepted as standard. These details should be observed, not because they are fixed by an arbitrary set of rules but because they are all founded upon the principle that graphs must give a clear idea with the min- imum of effort on the part of the reader. Title. The title of a graph must meet the same requirements that were established for the title of a table. 14 It should not give the con- clusion to be drawn from the graph, as "Sales Larger This Year Than Last"; nor the method of analysis, as "Frequency Distribution of Number of Employees. 15 Information such as the units of measure, 14 Chapter VIII, pp. 160-^61. 15 Whenever a title of this sort is used in this text, it is because the method being illustrated is of greater importance to the reader than the actual data. 342 BUSINESS STATISTICS and the subgroups of a classification, which will be clearly indicated by the scales and legends or key, need not be included in the title of a graph. Legend or Key. These terms are often used interchangeably, but for this discussion "legend" will refer to labeling written within the bars, sectors, etc., or adjacent to the curves of time series to tell what each represents. (See the graphs in chapter XX.) "Key" will refer to a group of blocks or lines at the bottom of a graph indicating by sample pieces of the various lines or types of cross-hatching the significance of each wherever it may appear on the graph. (See the graphs in chapter XXII.) There is no hard-and-fast rule as to which method should be employed. In general, it may be said that if the lines or hatching are repeated in several different parts of the graph it is better to use one key that will apply to all parts. In a map where the same kind of hatching or dots appears in a number of different areas, a key is practically unavoidable. If the graph includes two or more circles or sets of bars, each having corresponding parts that follow a common system of shading, it is easier to follow one key than to read the same legends in several different places. On the other hand, if there is plenty of space to print a clear legend right next to the curve or within the sector, good judgment will indicate that this should be done. Whenever legends are printed on the graph there are a number of points to consider, (a) There must be no possible confusion as to which curve or other part the legend is intended to mark. (&) If pos- sible, avoid printing between any line or bar and the scale from which its value must be read, (c) On a closely crossed grid a white space inclosed in a border should be left clear for printing each legend. (*/) Legends should be clearly printed, and worded as briefly as possible. The use of a key likewise calls for some words of caution, (a) The key must be neatly ruled and adequately labeled. () The lines or hatching must correspond exactly to those used in the graph, (r) The key is a part of the graph and should be inclosed within the outer border, if the graph has a border; certainly it should never be trans- ferred to some other page. Scales. A scale has two parts: its general label and the markings of its subdivisions. Labels: The label states the unit of the vertical scale or the numer- GRAPHS 343 ical classification of the horizontal scale, as tons, dollars, years, etc. When the units are counted in large groups instead of singly, it also indicates the number in each group. To avoid confusion in locating the decimal point units should be grouped in thousands, millions, or bil- lions, rather than in tens, hundreds, ten thousands, etc. For example, in a scale having a range of to 2,000 tons, the values might well be written in full, or they might be shown as .25, .50, 1.00, 1.25, etc., under the label "tons in thousands" or "tons, 000 omitted." Just "000" means nothing; one does not know whether to interpret it as "in hundreds" or "000 omitted." Sometimes the ciphers omitted are stated in the title, in which case they should not be repeated in the scale label. The full contraction must be indicated in either one place or the other, never divided between the two. If a multiple scale is being used, each scale must state the item to which it applies. A graph of index numbers or other per cents will be labeled either "index" or "per cent," with no reference to the original unit. The graphs shown in this text should be observed for standard practice in wording and arrangement. Note that the labels always read parallel to the base of the graph; that is, the label of the vertical scale appears across the top of that scale rather than vertically along the side, whereas for the horizontal scale it is in the center under the markings for years, months, etc. Scale divisions and grid lines: Grid lines are scale divisions that are drawn all the way across a graph. They are usually fine solid lines of uniform thickness, although the lines indicating the ends of years, intervals of 50, etc., may be heavier than the intervening lines in order to set off the major divisions of a chart. Frequently only these main guide lines are drawn all the way across, the other values being indicated by short stubs along the axis. It is not necessary to indicate the numerical values of each one of these stubs but only enough of them to enable the reader to determine the value of any plotted point without too much trouble. The figures that are printed along a scale should be directly opposite the points to which they refer. The methods of marking intervals in a time series require special attention in order to avoid confusion in reading the graph. First we shall consider the various ways of charting annual data. There are four alternatives, as shown in Figure 48. In A, the year is indicated directly below the grid line on which the point is plotted. This is the preferable method for recording values 344 BUSINESS STATISTICS at the same date, such as May 1, for several successive years; it can also be used for yearly averages or totals, or whenever a single figure represents the entire year. However, B is more suitable for the latter situation, since each point is plotted at the center of the space between two grid lines, the label for the year being also centered directly below it. Method B would also be correct for data as of June 30 or July 1, but not for any other given date, since the plotted points fall in the center of the yearly spaces. Method C is incorrect for yearly averages or totals but could be used if the data were as of December 31 or January 1, provided this fact were made clear in the title. For any other data it is ambiguous because one cannot tell whether the year named in the center of the space applies to the point on the grid line preceding it or following it. The last method, D, is the reverse of C; it is equally ambiguous and would be correct under no circumstances. The same principles can be applied to the correct graphing of monthly data. The space representing a year is usually set off by grid lines. This annual space must therefore be divided into 12 equal parts, each of which represents one month. The spaces indicating the months need not be labeled in a long time series, although if the scale is large enough the abbreviation or initial of the month at the center of each space is an aid to the reader. Years should be printed out in full, horizontally below the monthly labels, at the center of the year's space. If the monthly data are totals, averages, or mid-month recordings, they should be plotted at the center of each month's space. There will then be no value plotted directly on the grid line that marks the year's end, and it will be perfectly clear which point stands for December and which for January (See Figure 48-E). However, end-of -month data should be plotted at the end of each month's space, so that December 31 will quite correctly fall on the end-of -the-year's mark (See Figure 48-F). As in the case of annual data, Figure 48-C, this method is ambiguous unless the title of the graph indicates that the plotted points are recordings as of the last day of each month. It would also be possible to locate monthly stubs at the center of each monthly interval, Figure 48-G. This corresponds to method A for annual data, and causes no difficulty in reading the graph. How- ever, since the years are labeled at the center of the year's space, it GRAPHS 345 FIGURE 48 METHODS OF PLOTTING TIME PERIODS RIGHT ANNUAL DATA c 1933 1934 1935 B \ 1933 1934 1935 WRONG 1933 1934 1935 D 1933 1934 1935 MONTHLY DATA AMBIGUOUS 1936 1937 is more consistent to plot each month at the center of a space as in 48-E, rather than at a stub. Accompanying Table. Since no graph aims to record exact numer- ical values it is always desirable to provide an accompanying table for the benefit of the reader who wishes to verify or make further use of 346 BUSINESS STATISTICS the information. The table should appear on the same page as the graph, or on the page facing it, and both should read in the same direction. Students seldom realize the unfavorable impression made on an instructor or any other reader who is forced to compare a graph that reads vertically with a table that reads horizontally, or vice versa. It has already been suggested in the discussion of balance that a brief table may be printed on the graph in some unoccupied space if it does not interfere with the graphic presentation. Reference and Notes. The necessity for quoting the source and noting any discrepancies in the data was explained in discussing the requirements for statistical tables. 10 Practically the same rules may be applied to graphs although, if the information has been given in an adjacent table, reference on the graph need only be made to that table as the source. The reference, either to the accompanying table or the original source, is usually printed in the lower right-hand corner of the graph. Important Points in Actual Construction. No attempt will be made in this text to give a summary of the principles of mechanical drawing. A course in that subject is a great aid to anyone who wishes to draw graphs neatly and correctly. It is possible, however, to secure manuals on the subject, lettering guides, etc. A study of the instructions that come with ruling and lettering pens should help the student in his first efforts to use India ink. With a few hours of practice anyone can learn to handle a ruling pen and lettering stencils without blot- ting. Accuracy in scale and angle measurement is not beyond the capacity of the average person. Even lettering by hand is only a matter of a little care and practice in copying from lettering guides. In outline form, the order of steps in drawing a graph is as follows: a) Check all data for accuracy in computation or in copying from source. b) Plan the scales to conform to the correct size and proportion, within the range of the data. c ) Measure scales and draw axes and guide lines in pencil. (More pencil guide lines will be needed thnn finally appear on the graph.) d} Plot the data. Chapter VIII, p. 162. GRAPHS 347 e) Check plotting, reading from the points back to the data. /) Plan spacing of lettering titles, scales, labels, key, source, etc. g) Ink in all lines, including borders, guide lines, etc., taking time to let each section dry before doing further work near it. h) Ink in lettering, using stencils if possible. /) Erase all pencil marks. PROBLEMS 1. What is the order of arrangement of the bars in the bar charts appearing in chapter XIII? 2. The following is the production of anthracite coal in the United States at five-year intervals from 1900-40 (thousand short tons) : YEAR PRODUCTION YEAR PRODUCTION 1 900 57,468 1925 61,817 1905 . ... 77 660 1930 69 385 1910 84,485 1935 52,159 1915 88,995 1940 50,052 1920 89.598 a) Present these data in a bar diagram. b) Why is this form superior to a line diagram for these data? c) How would you read from this diagram the 40-year history of the anthracite coal industry? V Find an applied use of the band chart in a published source. Describe the contents of the chart and state briefly the major relations portrayed. 1 a) Plot the following data on four separate charts, corresponding to the four methods shown in Figure 44, A, B, C and D, pages 328-29 Use 1935 as the base for the index number graph. b) Explain briefly what you think each graph shows. APPROXIMATE SALES, GROSS PROFIT AND NET PROFIT OF A SMALL MANUFACTURING CONCERN, 1932 TO 1938 YEAR SALES GROSS PROFIT NET PROFIT 1932 $15,000 $1,000 $100 1933 22,000 3,000 400 1934 18 000 1 500 200 1935 26000 4,500 800 1936 20000 2,250 400 1937 42 000 6 750 1,600 1938 33.000 3.375 800 348 BUSINESS STATISTICS 3. a) Draw a graph of the grapefruit production data given below. b) Study and interpret the facts shown by your graph. PRODUCTION OF GRAPEFRUIT IN THE UNITED STATES, 1919 TO 1939 * YEA* PfiODUCTION (Million Boxes) California Florida Texas Total 1919 .4 .4 1 2 2 2 2 2 2 5.9 8.8 8 15 12 18 15 24 17 .2 2 3 3 10 12 16 15 6.3 9.4 11 20 17 30 29 42 34 1924 1929 1934 1935 1936 1937 1938 1939 Agricultural Statistics, 1938: and Crots and Markets, December, 1939. 6. a) Why should every ordinary scale chart have a zero base line? b) Why in using two vertical scales on the same chart should the two scales bear some fixed relation to each other? c) Under what conditions is it justifiable to use colored inks in drawing charts? d) In a two-dimensional graph how do you determine which values to plot on the base scale? 7. a) Find one published graph that you consider is correctly and effectively drawn, and explain why you think so. b) Find one published graph that you think has certain features that are incorrect, and give reasons. c) Find one published graph that you consider ineffective, and suggest changes that might add to its effectiveness. REFERENCES ARKIN, HERBERT, and COLTON, RAYMOND R., Graphs, How To Make and Use Them. New York: Harper & Bros., 1936. BRINTON, WILLARD C., Graphic Presentation. New York: Brinton Associates, 1939. HASKELL, ALLAN C., Graphic Charts in Business. New York: Codex. Book Com- pany, 1922. KNOEPPEL, CHARLES E., Graphic Production Control. New York: The Engi- neering Magazine Co., 1920. LEHOCZKY, PAUL N., Alignment Charts, Their Construction and Use, Engineer- ing Experiment Station Circular No. 34. Columbus, Ohio: The Ohio State University Studies, 1936. GRAPHS 349 MODLEY, RUDOLF, How To Use Pictorial Statistics. New York: Harper & Bros., 1937. MUDGETT, BRUCE D., Statistical Tables and Graphs. Boston: Houghton Mifflin Co., 1930. RIGGLEMAN, JOHN R., and FRISBEE, IRA N., Business Statistics. New York: McGraw-Hill Book Co., Inc., 1932, Appendix III. RIGGLEMAN, JOHN R., Graphic Methods for Presenting Business Statistics. New York: McGraw-Hill Book Co., Inc., 1926. SMITH, HERBERT G., Figuring with Graphs and Scales. Stanford University, California: Stanford University Press, 1938. Time Series Charts t A Manual of Design and Construction. New York: The American Society of Mechanical Engineers, 1938. CHAPTER XV FREQUENCY DISTRIBUTIONS AND GRAPHS FREQUENCY DISTRIBUTIONS A FREQUENCY distribution is simply one of the methods of classification of data, and in form resembles any other statis- tical table. An example that has already been introduced in the text is Table 16-A, chapter VIII (wage rates of explosives workers) . This particular form of classification has been reserved for special treatment because the idea of grouping large masses of data according to their quantitative characteristics is one of the most fundamental processes in statistics. In many phases of business operations it is an important first step toward more advanced analysis. A frequency distribution is always a classification of data in which the items are combined in groups according to size. The ''ordered classification" is the independent variable, and the numbers of items that appear in the several groups become the dependent variable. For example, the dependent variable, number of firms, might be classified in groups according to annual dollars of sales, number of tons of prod- uct shipped weekly, number of employees, or hourly wage rates paid. On the other hand various dependent variables might be tabulated with any of these classifications (independent variables) ; e.g., with a wage classification the dependent variable might be either the numbers of employees receiving the various rates, the numbers of years in which the several wage rates were paid, or the numbers of states in which the rates were standard. / The number of units, or items counted, in each group is called its frequency. ' /According to this method of grouping large numbers of detailed observations, each individual item or occurrence loses its identity and becomes one of a larger group that has a broader definition of quantity or value. For instance, in grouping wage data a single wape payment of $18.52 might become one of a group of 65 payments designated as "$18.00 to $18.99." fit follows that the basic requirements for a satis- factory frequency distribution are: (1) the value of each individual item must be known at the outset, and (2) the values must be grouped 330 FREQUENCY DISTRIBUTIONS AND GRAPHS 351 in such a way that the summary table will accurately represent the individual items from which it is compiled. First Steps in Analysis The steps that are followed in making an analysis by means of a frequency distribution will be illustrated by rent data that were col- lected in Columbus, Ohio. This small but representative sample of 155 rent payments was secured as a by-product of a study of consumer habits in the patronage of dry-cleaning establishments. Arraying the Data. The initial step was to list each rent payment as the reports came in from the interviewers. The result is shown in Table 56. This random listing gives no clue whatever to any possible TABLE 56 RENTS PAID BY 155 FAMILIES IN A CONSUMER SURVEY IN COLUMBUS, OHIO (Dollars Per Month) $50 25 18 75 55 53 30 50 31 24 13 15 40 65 68 70 80 80 35 35 40 45 40 40 48 50 $ 8 80 9 15 16 35 35 50 30 28 27 25 40 20 18 16 13 85 90 80 65 35 51 51 60 50 $60 75 75 95 35 35 35 35 32 22 22 20 40 9 17 18 30 30 35 35 35 85 95 80 70 35 $50 50 50 60 60 25 24 35 25 25 25 30 40 12 13 15 18 30 35 30 40 40 40 40 85 35 $75 75 53 55 60 60 60 35 65 60 40 40 45 35 35 25 15 16 18 21 21 30 30 35 40 35 $21 25 25 15 18 80 75 51 75 51 50 45 35 35 35 30 30 30 18 20 30 35 35 35 35 interpretation of the data. The question now arises, if they were ranged in order of size would any significant relationship appear? In order to answer this they were next arranged in an array, as shown in Table 57. 352 BUSINESS STATISTICS TABLE 57 ARRAY OF RENTS PAID BY 155 FAMILIES IN A CONSUMER SURVEY IN COLUMBUS, OHIO (Dollars Per Month) $ 8 $21 $30 $35 $48 $65 9 21 30 35 50 65 9 21 30 35 50 65 12 22 30 35 50 68 13 22 30 35 50 70 13 24 31 35 50 70 13 24 32 35 50 75 15 25 35 35 50 75 15 25 35 35 50 75 15 25 35 40 50 75 15 25 35 40 51 75 15 25 35 40 51 75 16 25 35 40 51 75 16 25 35 40 51 80 16 25 35 40 53 80 17 25 35 40 53 80 18 27 35 40 55 80 18 28 35 40 55 80 18 30 35 40 60 80 18 30 35 40 60 85 18 30 35 40 60 85 18 30 35 40 60 85 18 30 35 40 60 90 20 30 35 45 60 95 20 30 35 45 60 95 20 30 35 45 60 . There are various ways of putting data in an array, depending some- what on the form in which they have been collected. If each item is on a separate card or sheet or schedule, these could first be sorted according to size and then listed. Or they might be tallied by assigning one line of a ruled sheet of paper to each possible value and then writ- ing down each rent as it appears from the random assortment. The result would appear as in Figure 49. If the rents or tally marks are evenly spaced the resulting rows take the place of a rough bar diagram in indicating the distribution of frequencies according to rental value. To get a true picture it is necessary to have a line for every unit rental value in the series, whether or not it has any frequencies. An alternative method for analyzing either the sorted or tallied data would be to draw a simple bar diagram as shown in Figure 50. The range of values is clearly revealed by comparing the shortest bar at the top of the graph with the longest one at the bottom. The values at which there are concentrations and the number of similar items of various values can be seen by looking for the bars of equal length. This type of graph is seldom used for final presentation unless it portrays characteristics which are peculiar to the data and which can- FREQUENCY DISTRIBUTIONS AND GRAPHS 353 FIGURE 49 TALLY OF MONTHLY RENTS PAID BY 155 FAMILIES IN A CONSUMER SURVEY IN COLUMBUS. OHIO RINI NIJMHIR 01 FAMIIHS RINI MUMHIR 01 FAMMIIS $52 53 11 54 55 11 56 57 58 59 60 mi in 61 62 63 64 65 111 66 67 68 1 69 70 11 71 72 73 74 75 mi 11 76 77 78 mi 111 79 80 mi i 81 82 83 84 85 111 86 87 88 89 90 1 91 92 93 94 95 U not easily be shown by any other graphic form. It is a helpful graph in preliminary analysis, however, for it provides the basis for the examination which is necessary before the data can be grouped. Preliminary Grouping of the Data. From a study either of Fig- ure 49 or Figure 50 it can be seen at once that the whole range of data extends from a low value of $8 to a high of $95. The largest number of rents appears to be at $35 but there are certain other con- centration points, notably at $25, $30, $40, $50, and $60. $ 8 1 9 11 10 11 12 1 13 111 14 15 mi 16 in 17 i 18 mi n 19 20 111 21 111 22 11 23 24 11 25 mi 1111 26 27 i 28 i 29 30 rw THI 111 31 1 32 1 33 34 35 mi mi mi 36 37 38 39 40 mi mi 1111 41 42 43 44 45 111 46 47 48 i 49 50 mi 1111 51 1111 FIGURE 50 ARRAY OF RENTS PAID BY 155 FAMILIES IN COLUMBUS, OHIO (Each bar represents one family) 10 20 30 40 50 60 70 80 90 100 10 20 Data from Table 57. 30 40 50 60 70 RENTS IN DOLLARS 80 90 100 FREQUENCY DISTRIBUTIONS AND GRAPHS 355 1 At individual values: This suggests the next logical step which is the initial grouping process, that is, to count all the items having the same value. In the frequency array shown in Table 58 all of the rents appear in order along with the number of times each individual rent occurs J TABLE 58 FREQUENCY ARRAY OF MONTHLY RENTS PAID BY 155 FAMILIES IN A CONSUMER SURVEY IN COLUMBUS, OHIO RENTALS PAID NUMBER RENTALS PAID NUMBER RENTALS PAID NUMBER $8 1 $25 ... 9 $53 2 9 2 27 1 55 2 12 1 28 1 60 8 13 3 30 13 65 3 15 5 31 1 68 1 16 3 32 . ... 1 70 2 17 1 35 28 75 7 18 7 40 14 80 6 20 3 45 3 85 3 21 3 48 1 90 1 22 2 50 9 95 2 24 2 51 4 Total 155 The characteristics of the data begin to stand out more clearly. We now know exactly how many rents of each amount were paid. The $35 rent occurs 28 times, having the highest frequency in the array, while the $30 and $40 amounts are almost tied for second place with 13 and 14 frequencies, respectively. The rents less than $35 are con- centrated between $8 and $32, whereas those greater than $35 are spread over a range from $40 to $95. I In class intervals: However, there are still too many separate values listed for easy comprehension of the complete information regarding these rents. The entire situation can be readily grasped only after the 155 items have been grouped into a few classes. These classes must cover the entire range from $8 to $95 and must represent as far as possible the characteristics that have been observed from studying the individual items. Before continuing this process with the rent data it will be necessary to consider a number of points that must always be taken into account in determining the groups of a frequency distribution./ Principles for Grouping Data The questions that must be answered before deciding how to group any individual data are: 356 BUSINESS STATISTICS 1. Into how many groups, or class intervals, should a given set of data be divided? 2. What should be the width of each interval? 3. At what values should the class limits be set? 4. How should the class limits be designated? Number of Intervals. The number of intervals is less important than the width of intervals and the values of class limits. The exact number used will finally be determined by the range of the data after these other two points have been decided. There are a few rule-of- thumb guides, however, which aid in roughly determining the number of intervals. In the first place, since the purpose of grouping is to aid in the summary and comprehension of data, there should be no more intervals than can be quickly grasped. In the second place, the number of intervals cannot be so small that important characteristics of the data are concealed. These two criteria, however, are rather general to serve as operating rules in determining how many class intervals to use in a given frequency distribution.; Statisticians have indicated the number of intervals which in general meet the requirements of distributions of most kinds of data. Yule says that desirable conditions will usually be fulfilled if the "number of classes lies between 15 and 25." * A minimum is suggested by the statement that it is "desirable to have more than eight classes." 2 It has been suggested that the number of classes can be determined by the use of a formula which has been developed from the theory of binomial expansion. The formula as developed by Sturges 3 is: Number of class intervals = 1 + 3.322 log of number of observations in the distribu- tion. Solution of this formula indicates the following number of class intervals should be used with designated numbers of observations: NUMBER OP NUMBER OF OBSERVATIONS CLASS INTERVAL! 100 8 200 9 400 10 600 10 800 11 The number of classes should be determined only after making a 1 G Udny Yule and M. G. Kendall, An Introduction to the Theory of Statistics (London- Charles Griffin and Co., Ltd., 1937), p. 85. 2 Frederick E. Croxton and Dudley J. Cowden, Practical Business Statistics (New York: Prentice-Hall, Inc., 1937), p. 153. 8 H. A Sturges, "The Choice of a Class Interval," Journal of the American Statistical Association, Vol. XXI (1926), pp. 65-66. a. Harold T. Davis and W. F. C. Nelson Elements of Statistics ( Bloomington, Indiana- The Principia Press, 1935), p. 16. FREQUENCY DISTRIBUTIONS AND GRAPHS 357 careful study of all the characteristics of the data, instead of applying this formula indiscriminately. Width of intervals. There are no arbitrary criteria for determining the width of the class intervals in any distribution, but the following considerations are pertinent to the problem. 1. Class intervals should not be so wide that too much of the detail of the distribution is lost through grouping. It is true that the purpose of the frequency distribution is to summarize and to reduce the volume of the data to workable proportions, but the features of the data should not be concealed or eliminated through grouping in wide intervals. 2. Little is gained on the other hand by arranging data in very small intervals, if the number of classes then remains too large to provide an effective summary. 1 : 3. The total number of frequencies in the distribution serves as a rough guide to the size of the class intervals 4 that should be em- ployed. That is, when there are a great many observations the intervals can be relatively small because it will be permissible to have a large number of classes. Within the same range of data, if there are only a few observations the number of class intervals must be smaller and the width of the intervals will be correspondingly greater. 4. If there is any discernible pattern in the distribution, however, it will serve as a much better guide to the size of the classes. For instance, if hourly wage rates are being studied, it may be found that more men are paid even five- or ten-cent rates than any intervening amounts. This pattern must be preserved in the grouped data through correct choice of the size of class intervals. The class width should be five cents or some multiple of five cents so that there will be an equal number of concentration points within each interval. 5. The class intervals should be chosen so that there will be a minimum number of classes that contain no frequencies. 6. ' If the distribution which is being constructed is to be compared with others that are already prepared, the intervals in the new dis- tribution should be made to conform with those in the previous distribution.- If several different but comparable distributions are being prepared at the same time, the size of the class intervals must be 4 Related to the formula on page 356, Sturges recommends the following formula foi the determination of the size of class intervals: c . . . . , range of data Size of class intervals = ; ^ = p -. : 1 -f 3.322 log of number of observations 358 BUSINESS STATISTICS established in view of the characteristics of the several distributions. 7. / In a given distribution every effort should be exerted to make all class intervals equal.; The distribution of data in unequal intervals makes analysis difficult and certain kinds of computation impossible. Although unequal class intervals should not ordinarily be used, there are certain cases in which they are unavoidable. a.) .In some cases where a few high-valued observations are widely dispersed, they may be grouped in increasingly large class intervals, so as not to reveal the identity of the individual cases,- Unequal class intervals are frequently employed for this reason by various govern- ment departments. Table 59-A indicates a frequency distribution of this type. b.) /Analysis of the data may indicate that unequal class intervals define the homogeneity of the observations more accurately than equal intervals./ In Table 59-B for instance, unequal class intervals were used because the management felt that these divisions of purchases gave them the most assistance in their merchandising plans. c.) ! Equal relative increases in the widths of class intervals may be of more significance to a particular distribution than equal absolute changes. Consequently, a frequency distribution which appears to have unequal class intervals may in reality have intervals which are increas- ing in size at a uniform rate. TABLE 59 NUMBER OF BROADCASTING STATIONS IN THE UNITED STATES, BY ANNUAL REVENUE RECEIVED, 1935 * ANNUAL REVENUE NUMBEB OF STATIONS Less than $10,000 48 $10,000- 24,999 67 25 000- 49 999 59 50,000- 99,999 46 100,000-249,999 45 250 000-499 999 17 500,000 and over 7 Total 289 * Radio Broadcasting, Census of Business: 1935, p. S3. B NUMBER OF PURCHASERS AT THE COLUMBUS CONSUMERS' COOPERATIVE ASSOCIATION, BY VALUE OF PURCHASES, JULY 1, 1937, TO DECEMBER 31, 1937 1 VALUE OF PUSCHASES NUMIKI OF PURCHASERS $ 0.00 to $19.99 248 20.00 to 39.99 140 40.00 to 89.99 202 9000 to 149.99 74 150 00 to 299 99 49 300 00 and over 11 Total 724 t Unpublished Stud of the Columbus Association, 1938. of Patron Purchasing Cooperative y of Patro Consumers' 8. Finally, /the problem of determining the size of the class intervals in a frequency distribution cannot be separated from that of establish- FREQUENCY DISTRIBUTIONS AND GRAPHS 359 ing the location of the class limits/ that is, the values at which each group in the distribution will begin and end. It is usually necessary to consider these two points together before arriving at any decision with regard to either of them. Limits of Intervals. As in the determination of the size of class intervals, there is no single guide to the location of class interval limits. The several general criteria which follow indicate the nature of the problems that arise and the solutions that may be employed in various types of distributions. 1. If there is no pattern nor other guide, the very simple procedure of dividing the range of the data in the distribution by the approximate number of intervals may be usedj For convenience, the actual dividing points thus established would then be rounded to the nearest whole numbers. Such a procedure, however, may completely disregard impor- tant characteristics of the data that should be revealed by the distribu- tion, and should never be employed unless a careful study of the data has failed to reveal any pattern. 2. < If a pattern is discovered, the limits should be set so as to preserve in each group the characteristics of the individual items the same as in determining the width of the intervals/ This can be done by observing the values at which the frequencies are greatest and establishing the class limits so that these values fall at the midpoints./' For example, in the case quoted of concentration of wage rates at five-cent intervals, the limits would not be set at 25, 30, 35 cents, etc., but at 27.5, 32.5, 37.5, etc., so that the concentration point falls at the center of each interval. If ten-cent intervals were used the limits would be set at 27.5, 37.5, etc., or 22.5, 32.5, etc., so that there would be two concentration points in each interval, each equidistant from the center and from either end. 3. 'Even though no special pattern is present, the class limits should be established so that the value half way between the class limits approximates the arithmetic average of the observations included in each class interval. This midvalue of each interval is called the "mid- point" or the "class mark." 4. When possible, the limits should be chosen so that the midpoints are integers. The importance of this guide will become clear in the computation of averages from frequency distributions. As in the case just cited, it is usually more important to have the midpoint an integer than to have the class limits themselves integers. 360 BUSINESS STATISTICS 5. On some occasions one or both ends of the distribution may be left "open"; the minimum and maximum values are not shown, (See Table 59.) These open-end frequency distributions are some- times necessary to conceal the identity of the cases at the extremes, but the absence of limiting values is a serious handicap in subsequent analysis. Designation of Class Limits. The interpretation of the data in a frequency distribution and the evaluation of their accuracy depend largely upon the precise designation of the class limits. The method of designation may in turn depend upon the nature of the data involved. Discrete or continuous data: An important consideration is whether the data are discrete or continuous. Discrete data are those which occur only at exact values at regular intervals but never at any intervening values.' For example, stock prices are quoted in eighths of a point. With the exception of a few special listings, no stock would be quoted, at any price between | and i, f and | v etc. Likewise a classification according to number of employees could never be anything except whole numbers. In the latter case the classes would naturally be designated as 1 to 10, 11 to 20, etc., and no question could arise as to any fractional value between 10 and 11. / Continuous data, on the other hand, are those which may occur at every conceivable point along a continuous scale of valuesj This distinction between measured values and separate items arid the methods for handling each statistically will be discussed in greater detail in chapter XVII. As a matter of fact, a classification in discrete units is much more puzzling to handle correctly in computation but, when class limits are being designated, continuous data afford a greater variety of alternative methods. Examples of methods: Some of the methods for designating class limits are better than others in clarifying the actual limits that are employed. One of the most common methods has been shown in Table 59- Other methods are illustrated in Figure 51. Of the four methods, the one shown in Figure 51-A is the poorest for it may be ambiguous. It is not clear whether exactly $500 is included in the first or the second interval. In spite of this weakness, this form is widely used and is ordinarily interpreted as $250 and under $500. Figure 51-C differs from the form in Table 59-A only in the loca- FREQUENCY DISTRIBUTIONS AND GRAPHS 361 FIGURE 51 METHODS OF DESIGNATING CLASS LIMITS ABC $ 250-$ 500 Under 30 cents $ 51-$100 500- 750 30 and under 35 cents 101- 150 750- 1,000 35 and under 40 cents 151- 200 1,000- 1,250 40 and under 45 cents 201- 250 1,250- 1,500 45 and under 50 cents 251- 300 1,500- 1,750 50 and under 55 cents 301- 350 1,750- 2,000 55 and under 60 cents 351- 400 tion of the class limits: in one the round hundreds and fifties are the upper values of the classes, whereas in the other various multiples of round thousands are the lower values of the classes. Either of these forms can be used, Figure 51-C being preferred for discrete data, and Table 59-A for continuous data. The purpose in each case is to make the class intervals equal. Discrete data, which usually start at a value of one, must read 1-50, 51-100, etc., whereas continuous data are measured from zero and read 0-99, 100-199, etc. In the case of continuous data, the person who prepares the table must make a decision regarding significant figures, whether he uses the form in Table 59-A, 59-B, or even Figure 51-B. In the first case the values are rounded at dollars, so that presumably any value up to $9,999.50 would go in the first class, and $9,999.50 and over in the second, etc. Similarly in Table 59-B the dividing line is $19.995. The designation in Figure 51-B indicates an indefinite number of decimal places, although in actual practice the dividing point between classes would seldom be carried any farther than a half cent. Another common method of indicating class values, especially when the intervals are quite small, is by the value of the midpoint, as average grade, 75, 80, 85, 90, where 80 per cent includes everything from 77.5 to less than 82.5, etc. In other cases, classes that are listed in this way may represent single unit values of a discrete series. Of all these possible methods for designating class limits, Figure 51-B is the least ambiguous. The units in which any particular data are expressed will ordinarily make clear to the reader to how many significant figures the class limits have been carried. This method requires more space for the stub of the table but it is nevertheless the method preferred by the authors. An Example of the Preparation of a Frequency Table The principles set forth in the preceding section will now be applied in the preparation of a frequency distribution of the sample of rents 362 BUSINESS STATISTICS paid in Columbus, Ohio. The preliminary preparation of these data was carried out earlier in the chapter, leaving them in the form found in Table 58. The next step is to determine the number of class intervals, the width of interval and the interval limits that will pro- duce a concise and effective table. This means a table that is compact enough to be grasped quickly, and with the details arranged so that no essential characteristic of the data will be lost. The range of $87 between the lowest and highest rents paid imme- diately suggests the use of nine $10 intervals. Nine intervals for 155 items appears reasonable and the $10 width is convenient. The most important consideration, however, is the existence in the distribu- tion of concentration points at the $5 and $10 rents, i.e., the tendency to fix rents at $25, $30, $35, $40, etc. This means that the width of the class interval must be $5 or some multiple of $5. A $5 interval would give too many classes, a $15 interval too few, hence $10 emerges as the proper width to use and the number of intervals is automatically set at nine, if the first class is set at $7.50 to $17.50. The first class might also read $2.50-$12.50, so that the last of ten classes would read $92.50-$102.50. There is no general rule that requires the use of either one or the other of these systems of intervals and the distribution of frequencies will, of course, be different accord- ing to which is employed. Perhaps the best plan is to regroup the initial $5 intervals in $10 intervals according to both systems and then select the one that gives the smoother distribution or appears to be the better description of the data. In the distribution of rents the $7.50-$17.50 set of intervals seems preferable. The same circumstance that led to the selection of $10 intervals also becomes the guide to the proper class limits. Each class will con- tain two points of rent concentration. These must fall at equal dis- tances from the center and ends of the class in order to meet the requirement that the average value of the items included in any class shall be approximately equal to the midpoint of that class. Hence the first class containing the $10 and $15 concentration points must have its midpoint at $12.50. The next containing the $20 and $25 concentration points must have its midpoint at $22.50 and so on. The class limits, therefore, must be $7.50, $17.50, $27.50, $97.50. Actual Frequencies. The three parts of Table 60 contain different distributions resulting from the use of three distinct sets of $10 class intervals. That is, A, B, and C are independent distributions each FREQUENCY DISTRIBUTIONS AND GRAPHS 363 TABLE 60 THREE FREQUENCY DISTRIBUTIONS OF MONTHLY RENTS PAID BY 155 FAMILIES IN A CONSUMER SURVEY IN COLUMBUS, OHIO CLASS INTERVAL (1) FREQUENCY ^< 2) CLASS MARK CLASS INTERVAL AVERAGE Frequency Distribution A $ 5 and under $ 15 7 $10 $11.0 15 and under 25 26 20 18.5 25 and under 35 26 30 28.2 35 and under 45 42 40 36.7 45 and under 55 19 50 49.6 55 and under 65 10 60 59.0 65 and under 75 6 70 67.2 75 and under 85 13 80 77.3 85 and under 95 4 90 86.2 95 and under 105 2 100 95.0 Total . 155 Frequency Distribution B $ and under $ 10 3 $ 5 $ 8.7 10 and under 20 20 15 15.8 20 and under 30 21 25 23.6 30 and under 40 43 35 33.3 40 and under 50 18 45 41.3 50 and under 60 17 55 51.2 60 and under 70 12 65 61.9 70 and under 80 9 75 73.9 80 and under 90 9 85 81.7 90 and under 100 3 95 93.3 Total . 155 Frequency Distribution C $ 7.50 and under $17.50 16 $12.50 $13.6 17.50 and under 27.50 27 22.50 22.0 27.50 and under 37.50 44 32.50 33.2 37.50 and under 47.50 17 42.50 40.9 47 50 and under 57.50 . . 18 52 50 51.0 57.50 and under 67.50 11 62.50 61.4 67.50 and under 77.50 10 72.50 73.3 77.50 and under 87.50 9 82.50 81.7 87.50 and under 97.50 3 92.50 93.3 Total 155 of which has been constructed from Table 58. Column 1 records the number of rents falling within the limits indicated for the several classes in each of the three distributions. In distributions A and B the two concentration points fall at the beginning and center of the intervals. In distribution C, however, the center of the interval or class mark lies midway between the two concentration points. Columns 2 and 3 have been added to Table 60 to demonstrate the superiority of distribution C. The class marks in column 2 con- 364 . BUSINESS STATISTICS form to the definition previously given. The class averages in column 3 are obtained from the array in Table 57. For example the seven rents recorded between $5 and $15 total $77 or an average of $11. Parallel computations for each class of each distribution lead to the averages as recorded. Comparison of column 3 with column 2 shows that in distribution A the averages are less than the class marks in all classes except the first and that the differences are appreciable except in the "$45 and under $55" class. Likewise, in distribution B the averages are below the class marks except in the first and second classes. In distribution C four averages are above and five below their respective: class marks and the differences between the averages and the class marks are small. Distribution C, therefore, meets the requirement that class marks should approximate the actual averages of the items in- cluded, whereas the other two distributions contain a definite bias. This bias would have an adverse effect upon any numerical measures computed from those distributions. If the size of the rent sample were increased to several thousand items, the averages and class marks in distribution C would tend to coincide, but the bias would persist in distributions A and B. For this reason the characteristics of the universe "rents in Columbus, Ohio*' can be studied from distribution C only. Percentage Frequencies. Percentage frequencies are preferable to actual frequencies for some purposes. Table 61 (which has been set up with title and headings such as would be used in a presentation table, in contrast to the work-table headings of Table 60) shows both the actual frequencies from Table 60-C, and their percentage distribu- tion. Two major uses of percentage frequencies should be mentioned: (1) the comparison of the individual frequencies with each other and with the total, and (2) comparisons between two or more distributions having the same or equivalent class intervals. Thus from Table 61, column 2, it is apparent that more than one-fourth of the rents were between $27.50 and $37.50, that less than one-fourth were above $57.50 and that more than one-fourth were less than $27.50. The advantages of the percentage frequencies in comparing two distribu- tions graphically will be shown later in the chapter. GRAPHS OF FREQUENCY DISTRIBUTIONS The frequency distribution is a separation of a whole into parts, the frequencies being merely a record of the number of individual FREQUENCY DISTRIBUTIONS AND GRAPHS 36} TABLE 61 NUMBER AND PERCENTAGE DISTRIBUTION OF FAMILIES IN COLUMBUS, OHIO, ACCORDING TO VALUE OF MONTHLY RENTALS PAID RENTALS PAID FAMILIES (1) Number (2) Percentage Distribution $ 7.50 and under $17.50 16 27 44 17 18 11 10 9 3 10.3 17.4 28.4 11.0 11.6 7.1 6.4 5.8 1.9 17.50 and under 27.50 27.50 and under 37.50 37.50 and under 47.50 47.50 and under 57.50 57.50 and under 67.50 67.50 and under 77.50 77.50 and under 87.50 87.50 and under 97.50 Total 155 100. items falling in each quantitative class of the distribution. The major purpose in presenting a distribution graphically is to emphasize the relation of the parts to the total and to each other. The graphic methods used in presenting frequency distributions are perhaps more standardized than in the case of any other kind of statistical data. The forms have become so widely accepted that it is necessary to follow without noticeable deviation the generally accepted rules for their construction. This does not imply, however, that we may not inquire into the underlying principles that have led to the develop- ment and universal acceptance of these methods. Construction and General Characteristics As was indicated at the end of chapter XIII, frequency distribution graphs are of the two-dimensional variety. The class intervals are always plotted on the horizontal axis and the frequencies on the vertical axis. Ordinary arithmetic scales are used on both, except for some very specialized types of distribution which are excluded from the present discussion. The vertical scale must always begin at zero, but the horizontal scale need include only the range of the class values, plus an extra interval at either end. The two most common frequency diagrams are the histogram and the frequency polygon. 5 5 A third diagram, the smooth curve, is often discussed along with the histogram and the frequency polygon. It is a trace of the form which the frequency distribution would take if a very large number of cases were included and class intervals become infinitesimally small. From the point of view of universe and sample this concept is of considerable importance and will be employed in chapter XXVIII. The smooth curve has little application at the elementary level; hence no further reference will be made to it jn this chapter 366 BUSINESS STATISTICS The Histogram. fThis form of diagram consists of contiguous rec- tangles, or columns, ranged along the base scale, the height of each one being determined by the number of frequencies in the class upon which it stands. The total combined area of all the columns represents the total number of frequencies in the distribution.! It may be consid- ered that each column is like a pile of coins, each coin representing a single frequency. The thickness of the coin equals the value of one frequency on the vertical scale, and its diameter corresponds to the width of the class interval. Viewed from the front each coin occupies a narrow rectangular space ) 1 . Several such adjacent piles of different heights, would look very much like a frequency histogram. If a few coins were moved from one pile to another the total front view area, representing the total number of coins or frequencies, would remain the same regardless of changes in the distribution. Figure 52-A which represents the rent data distribution of Table 60-C, illustrates the important features of all histograms. The greatest concentration of frequencies is at once apparent from the location of the tallest column on the base $27.50 to $37.50. The other columns start from zero frequency on the left and gradually increase in height as they approach this class of maximum frequency, while those on the right fall away from it and finally reach zero again. This is the characteristic shape of a frequency graph portraying the chance occur- rence of a set of homogeneous events. Variations from this usual shape will be discussed later in the chapter. / Another point to be noted from the histogram is that each column rests not upon a single point but upon the entire interval included within the class limits. This indicates that the frequencies in any interval are spread over that interval, and that the base scale values occur in a continuous sequence. / The Frequency Polygon. This form of diagram is illustrated in Figure 52-B, using the same data as in 52-A. The histogram has been lightly blocked in as a background to show that the polygon can be drawn by connecting the midpoints of the successive columns. It could equally well be drawn without the histogram, by plotting points measured from the abcissas at the midpoints of the class intervals and the ordinates of the corresponding frequencies. The line connecting these points is extended to zero at the midpoint of the class at either end beyond the range of frequencies; thus the broken line or "curve," together with the base line, forms a "polygon" inclos- ing the entire area of frequencies. / FREQUENCY DISTRIBUTIONS AND GRAPHS 367 FIGURE 52 Two TYPES OF PKFQUENCY DIAGRAM OF RENT DATA NUMBER OF RENTALS TFREQUENCItS] NUMBER OF RENTALS 40 30 20 10 A. HISTOGRAM 40 30 20 10 7.50 1750 2750 37504750 5750 67.50 7750 87.50 975010750 40 30 20 10 B. FREQUENCY POLYGON 40 30 20 10 7.50 I75O 27.50 375O 4750 575O 6750 7750 875O 9750 10750 DOLLARS OF RENT PAID [CLASS INTERVALS] Data from Table 60-C. It can be demonstrated that the total area of the polygon is exactly equal to that of the histogram, although the area included in each class has been slightly altered. When the midpoint of the rectangle standing on the base, $17.50 to $27.50, is joined with the midpoint of the rectangle standing on the base, $27.50 to $37.50, the triangle CDH is added to the area on the $17.50 to $27.50 base and the triangle ABH is removed from the area on the base $27.50 to $37.50. But the areas of the triangles are equivalent (AB = CD and angle a = angle b), therefore the area of the figure X 2 CBX 4 is equal to the area of the figure X 2 CDABX. A similar argument for each adjacent pair of rectangles proves that the area between the polygon and the base 368 BUSINESS STATISTICS line is the same as the sum of the areas of the rectangles. Hence the polygon is a smooth trace of the histogram in which the total area is preserved but the idea of graduated increase and decrease in the frequencies is substituted for the steps of the histogram. It must be noted, however, that (unless by chance points G-C-B lie in a straight line) triangle CDH which has been added to the class $17.50 to $27.50 is not equivalent to triangle CEK which has been removed from it, and that area XJ^CHX^ is therefore not equal to area X^EDX^. This fact leads to the conclusion that the histogram is the more appropriate form to use when it is necessary to represent exactly the number of frequencies in each class, whereas the polygon gives a better picture when a smoothed distribution is wanted. Uses of Each Type of Graph For the majority of frequency distributions of economic or social data, that is, for collected data that are in reality samples taken from a larger universe, either type of diagram may be employed. However, because the polygon smooths the contour of a distribution while maintaining the total area, it is suitable only for data that are continuous. Continuous Data. In the rent distribution suppose that instead of 155 items the sample had been doubled giving a total of 310 rentals, its representative character, of course, being preserved. If the number of class intervals were then doubled and the width of each interval reduced to five dollars, 6 the resulting histogram would retain the general shape of Figure 5 2- A, but due to the operation of the principle of statistical regularity some of the variability of sampling would be removed. Consequently the new histogram would resemble the polygon in Figure 52-B more closely than does the original histogram. If the cases in any such distribution could be multiplied indefinitely, and the class widths decreased accordingly, the final contour would be prac- tically identical whether drawn as a histogram or a polygon. 'This illustrates the assumption underlying the drawing of a polygon it interpolates from the sample data the probable intervening values of the universe. Thus it gives a description of the universe derived from the information supplied by a single sample of continuous data. / 6 Further reduction of the width of the interval would not be possible in this distri- bution, regardless of the size of the sample, because of the concentration of the actual amounts on the five-dollar values, $35, $40, etc. FREQUENCY DISTRIBUTIONS AND GRAPHS 369 'This does not mean, however, that each point of the polygon can be read as an actual frequency unless the total frequency is infinite and the width of each class interval infinitesimal. For example, in the rent polygon there might be a natural tendency to think of every point on the polygon as representing a number of frequencies, a ten- dency to think of 35 rentals, for instance, at $27.50. This is erroneous because in the polygon as in the histogram there are only 35.5 cases between $22.50 and $32.50. It is correct to say that at the $27.50 level rentals occurred, at the rate of 35 per $10 interval. TABLE 62 NUMBER OF EMPLOYEES IN NEW JERSEY LAUNDRIES, NOVEMBER, 1936* No. OF EUFLOYEXS No. OP LAUNDRIES 1 and less than 10 16 10 and less than 25 41 25 and less than 50 24 50 and less than 100 7 100 and less than 200 4 200 and less than 300 4 300 and less than ^00 2 Total 98 Monthly Labor Review, United States Department of Labor (October, 1937). p. 888. ; Discrete Data. In dealing with discrete data, a polygon would incorrectly indicate intervening values that could not possibly exist; consequently the histogram must be used. Each class of a discrete distribution may include only one unit, or several discrete units grouped together, j An example of the latter is shown in Table 62, number of laundries employing a specified number of workers. As grouped in this table, each class contains several distinct sizes of laundry. Therefore the columns of a histogram which indicate that the frequencies in each group are distributed over the entire class would give a correct repre- sentation. However, these data might be further broken down into the number of laundries employing 8, 9, 10, etc., workers, each class having a width of only one natural unit (the individual worker). In this case the discrete nature of the data would be more accurately represented by separate bars erected at -the midpoint of each class interval. The frequencies are all concentrated at these points, and not distributed from 8.5 to 9.5, 9.5 to 10.5, etc. It should be noted that this is the only situation in which separated bars may be used in a frequency diagram The adjacent columns of 370 BUSINESS STATISTICS a histogram do resemble bars in a general way, but are of an entirely different nature. The area of the columns of a histogram is important, whereas in a bar diagram only the height is measured. When the class intervals are equal, as in all the illustrations up to this point, the heights of the columns are in the same proportion to one another as their areas, because all of their bases are equal. In the case of unequal class intervals, however, the bases of the columns are unequal, hence the areas are not in direct proportion to their heights. This point will be explained later in the chapter. TABLE 63 NUMBER OF JUNIOR DRESSES SOLD DURING MONTH OF FEBRUARY, ACCORDING TO SIZE * SlZlB NUMBII OP DRESSES SOLD 9 171 11 1,082 13 1,676 15 1,335 17 384 Total 4.868 ' Confidential information from a Buffalo department store. Another example of a discrete natural unit is shown in Table 63. Sizes of junior dresses are discrete classes, hence the use of separated bars is warranted, as shown in Figure 53-A. However, in order to give the impression of area representing the total number of dresses sold, the use of the histogram, Figure 53-B is more common. In this case, the odd-numbered size is the natural unit and is just as indivisible as was the individual employee in the preceding example. There are no FIGURE 53 FREQUENCY DIAGRAMS OF DISCRETE DATA: NUMBER OF DRESSES SOLD IN JUNIOR SIZES NUMBER SOLD NUMBER SOLD leoofAP B I FR! n ^ ieoo 1200 800 400 11 13 15 SIZES 17 9 11 13 15 17 SIZES Dat* -from Table 63. FREQUENCY DISTRIBUTIONS AND GRAPHS 371 junior dress sizes other than those given in Table 63 7 so that the number of classes in such a distribution could not be increased nor the class width decreased, no matter how many cases might occur in the sample. Any business which sells size merchandise such as men's shirts, gloves, shoes, etc., can utilize this type of analysis in controlling its purchases and inventory. From his past records of regular sales (not including year-end or clearance sales) a merchant can prepare frequency distributions and histograms of sizes of merchandise as guides to his next year's purchases 01 to the maintenance of his regu- lar stocks. The same distributions and graphs would not be useful in other stores or for other neighborhoods, for the distribution of sizes sold by a particular merchant is peculiar to the characteristics of his clientele. The distributions will be different, depending upon age, nationality, economic status, occupations, and other characteristics of the people who purchase in each neighborhood or at specific stores. Adaptations of Frequency Graphs The Cumulative Graph Ogzve.-jFor some kinds of analysis and description the cumulative frequency distribution and curve (usually called ogive) 8 are of more value than the forms just described. A cumulative frequency distribution can be constructed from an ordinary frequency distribution by adding the frequencies of successive class intervals, beginning at the smallest (the largest) class of the distribution, and showing each of these successive totals as the number of cases which is smaller than (greater than) the value of the upper (lower) class limit at that point J Table 64, which contains two types of cumu- lative distributions constructed from Table 61, shows how they are derived. The frequencies are cumulated from the lower limit to the upper limit of the table in column 2, and from the upper limit to the lower limit in column 3. Table 65 is the presentation form for the results demonstrated in Table 64. The information obtainable from Table 65 is in more usable form than that provided by Table 61. For instance, without any kind of arithmetic treatment, it is imme- diately obvious from column 1 that more than half of the families in the sample (87) paid monthly rentals of less than $37.50, and from 7 The even numbered sizes are used in a different classification, misses' dresses. 8 The name ogive is an architectural term given to the rib of t pointed vault or gothic arch, which has the same shape as this type of curve. 372 BUSINESS STATISTICS TABLE 64 CUMULATIVE FREQUENCY DISTRIBUTIONS OF MONTHLY RENTALS PAID BY 155 FAMILIES IN COLUMBUS, OHIO CLASS INTERVAL (1) FREQUENCY (2) CUMULATIVE FREQUENCY, LESS THAN UPPER LIMIT ^ (3) CUMULATIVE FREQUENCY, LOWER LIMIT AND ABOVE $ 7.50 and under $17,50 16 16 155 17.50 and under 27.50 27 43 139 27.50 and under 37.50 44 87 112 37.50 and under 47.50 17 104 68 47.50 and under 57.50 18 122 51 57.50 and under 67.50 11 133 33 67.50 and under 77.50 10 143 22 77.50 and under 87.50 9 152 12 87.50 and under 97.50 3 155 3 Total 155 column 3 that about one-fifth of the group (33) paid rentals of $57.50 per month or over. The curves of these two types of cumulative frequency distributions are shown in Figure 54. Curve L represents the "less than" distribu- tion and curve M the "more than" distribution. It should be noted that in Figure 52-B the midpoints of the class intervals were joined to form the frequency polygon, whereas in the ogives the end values are joined. In the "less than" ogive there are, for example, sixteen cases below $17.50; therefore the frequency 16 is plotted at the upper limit of the $7.50 to $17.50 class. Similarly the next frequency, 43, is plotted at $27.50, etc. There are two major characteristics of the ogive, the most important of which is the ease of interpolation which its use permits. In order TABLE 65 NUMBER AND PER CENT OF FAMILIES IN COLUMBUS, OHIO, PAYING MORE THAN AND LESS THAN A SPECIFIED MONTHLY RENTAL RENTALS PAID FAMILIES RENTALS PAID FAMILIES (1) Number (2) Per Cent (3) Number (4) Per Cent Less than $17.50 16 43 87 104 122 133 143 152 155 10.3 27.7 56.1 67.1 78.7 85.8 92.3 98.1 100. $ 7.50 and more. . 17.50 and more. 27.50 and more. 37.50 and more. 47.50 and more. 57.50 and more. . 67.50 and more. . 77.50 and more. 87.50 and more. 155 139 112 68 51 33 22 12 3 100. 897 72.3 43.9 32.9 21.3 14.2 7.7 1.9 Less than 27.50 Less than 37.50 Less than 47.50 Less than 57.50 Less than 67.50 Less than 77.50 Less than 87 50 Less than 97 50 FREQUENCY DISTRIBUTIONS AND GRAPHS 373 FAMILIES [ABSOLUTE FREQUENCIES] 155 140 120 100 80 60 40 20 FIGURE 54 OGIVES: CUMULATIVE FREQUENCY DIAGRAM OF RENT DATA FAMILIES <++ CURVE L r LES5 THAN UPPER LIMiTS] * MCDJAN^j UMITS| 100 75 50 25 750 1750 37.5O 37*0 47.50 5750 6^ 50 7750 87.50 97.50 DOLLARS OF WT.NT PAID Data from Table 65. to determine the number of cases in which less than a given amount, say $40, is paid for rent, it is only necessary to make a vertical ruling from $40 on the horizontal scale to ogive L and then from this point of intersection make a horizontal ruling to the Y axis. This indicates a frequency of approximately 91 families who pay less than $40 per month. The second important characteristic is the slope of the ogive. Where the slope is steepest there is the greatest concentration of frequencies, and wherever it is less steep there are fewer frequencies. The cumulative distribution and the ogive are often presented on a percentage basis in practical work. To illustrate this usage, col- umns 2 and 4 have been included in Table 65. In Figure 54 the scale at the right has been so arranged that 100 per cent is in the same position as 155 on the left-hand scale. The ogives L and M represent respectively either columns 1 and 3 or columns 2 and 4 of Table 65. Information concerning the distribution of rents in the sample can be obtained by reading the left scale as previously indicated. The percentage scale is independent of the actual number of cases in the 374 BUSINESS STATISTICS sample. From it can be read facts concerning the distribution of rents in Columbus, on the assumption that the sample is representative of the entire city. Thus ogive L shows that 25 per cent of Columbus families pay monthly rents of not more than $26.50 and ogive M shows that 25 per cent pay at least $54.00. The point of intersection of the two ogives at a frequency of 50 per cent shows that half of the families pay more and half less than about $35.25 per month. The ogive has an additional use in the graphic determination of measures of central tendency and dispersion. In particular this diagram will be referred to in chapter XVII. Histogram of Unequal Classes. In the discussion of the principles for constructing frequency distributions provision was made for unequal class intervals under certain conditions. The graph of such a distribu- tion requires special explanation because the use of the methods previously described would produce definite misrepresentation.! TABLE 66 HOURLY WAGE RATES PAID TO NEWLY HIRED EMPLOYEES BY 52 INDUSTRIAL CONCERNS IN BUFFALO, NEW YORK, IN 1940* DISTRIBUTION A EQUAL INTERVALS DISTRIBUTION B UNEQUAL INTERVALS Hourly Wage Rate in Cents (1) No. of Concerns Hourly Wage Rate in Cents (2) No. of Concerns (3) Heights of Columns Adjusted to Preserve Frequency Area in Unequal Intervals 27.5-32.4 2 2 2 10 24 8 2 2 27.5-37.4 4 12 3 5 8 6 2 10 2 2 6 15 25 40 30 10 5 1 32.5-37.4 37 5-47.4 37.5-42.4 47 5-48.4 42.5-47.4 48.5-49.4 47.5-52.4 49.5-50.4 52.5-57.4 50.5-51.4 57.5-62.4 51.5-52.4 62.5-67.4 52.5-62.4 67.5-72.4 62.5-72.4 Total 52 52 * Source confidential. A tabulation of hourly hiring rates in Buffalo is presented in Table 66. Distribution A is divided into equal five-cent class intervals. Nearly half of the cases fall in one class and as a result the table does not provide as much information as we should like to have from a frequency table concerning hiring rates. The effect of this concentration is even more evident in the histogram of Figure 5 5- A. The middle rectangle is so large in relation to the others that the FREQUENCY DISTRIBUTIONS AND GRAPHS 375 comparison of frequencies by means of the areas of the several rec- tangles discloses little information beyond the fact, evident from the table, that the hiring rates are concentrated around 50 cents an hour. In order to learn more about hiring rates, Distribution B was pre- pared from the original source. Ten-cent intervals were used below 47.5 cents and above 52.5 cents, but the middle class was subdivided into 5 one-cent intervals to provide additional detail concerning the 24 cases in that class. That is, unequal class intervals were introduced as a means of increasing the usefulness of the table. This distribution is plotted in Figure 55-B in the usual way, i.e., with the heights of the rectangles of the histogram representing the frequencies just as written in column 2 of the table. The appearance of the graph is sufficient evidence of its inaccuracy. The difficulty lies in the fact that the areas allotted to the frequencies in the several classes do not correspond to the equivalent areas in Figure 5 5- A. The contrast can be followed in a tabulation of the two graphs (Table 67). The areas within the several classes are not the same for A and B nor are the two total areas. The last two columns indicate, however, TABLE 67 AREAS OF CORRESPONDING RECTANGLES OF FIGURE 55 (Frequency X width of class interval) CLASS INTERVALS (1) FIGURE 55-A (2) FIGURE 55-B (3) FIGURE 55-C 27 5-37 4 12X5 = 10) Q 12X5 = lop 12X5 = 10) 6Q (10 X 5 = 50}- 60 24 X 5 =120 (8X5 = 40) J2X5 = 10f- 50 (0X5= 0) Q 1 2 X 5 = 10J- 10 4X 10 12 X 10 3X1 = 3 5X1 = 5 8X1=8 6X1=6 |2 X 1 = 2. 10 X 10 2X 10 = 40 = 120 = 24 = 100 = 20 2 X 10 =20 6X10 =60 15 X 1 = 151 25 X 1 = 25| 40 X 1 = 40|-- i M) *7 V-47 4 ... 47 ^ 52 4 52 5-62 4 30 X 1 = 30) [10 X 1 = 10J 5 X 10 =50 1 X 10 =10 62 5-72 4 Total Area 260 304 260 exactly what should be done to Distribution B in order to obtain a graph whose area will be comparable with the graph of Distribution A. The areas of rectangles on class intervals that have been increased from five cents to ten cents in Figure 55-B have been doubled, and the areas PKiUKB 53 FREQUENCY DIAGRAMS OF HOURLY WAGE RATES PAID BY FIFTY-TWO INDUSTRIAL CONCERNS NUMBER OF CONCERNS EQUAL INTERVALS 20 15 10 I I ' 27.5O 32.5O 3750 425O 47.5O 525O 575O 6250 675O 72.5O 15 10 5 O 4O 35 30 25 20 15 10 5 Bl ' UNEQUAL INTERVALS- INCORRECT - i fc i 27.50 3750 4750 52.50 62 5O 72.5O UNEQUAL INTERVALS CORRECT 1 27. 5O Data from Tables 66 and 67. 37 5O 4750 52 5O 62.5O RATES IN CENTS 72.5O FREQUENCY DISTRIBUTIONS AND GRAPHS 377 of rectangles on the middle class interval which has been subdivided into 5 one-cent intervals total one-fifth of the former amount. There- fore, dividing the frequencies of the first, second, eighth, and ninth classes by 2, (i.e., multiplying by ,; ), and multiplying each of the intervening classes by 5 (i.e., by 5), gives the set of heights of columns adjusted to compare with frequencies of even five-cent inter- vals. These adjusted heights, as listed in column 3 of Table 66, are used for the histogram shown in Figure 55-C. The areas of this diagram agree with those in Figure 5 5- A (shown in Table 67). The first two rectangles of A are identical with the first rectangle of C The areas of the third and fourth rectangles of A are equivalent to the area of the second rectangle of C. Similar computations show the equivalence of all the corresponding rectangles in the two diagrams. Thus in C the areas of the total and the individual classes respectively have been preserved. At the same time additional information has been presented concerning the large number of hiring rates in the middle class without sacrificing any essential facts relative to the rates in the other class intervals. / The general rule for adjusting the frequencies for use in a diagram 9 or a distribution containing unequal class intervals may be stated as follows: divide the actual frequency of each class by the width of the class to obtain unit frequencies; multiply these unit frequencies by the width of the equal class intervals of another distribution with which they are to be compared. If no comparison is involved the unit frequencies themselves may be plotted or any constant multiple of them./ Comparison of Two Distributions. In applied statistical work as well as in more advanced analysis occasions arise which require that two distributions be represented on the same graph. The purpose is to show the relation between the contours of the two curves and the positions of measures descriptive of the two distributions. Polygons should be used for comparison because the rectangles of histograms would overlap making clear-cut representation impossible. In comparing two polygons certain requirements must be met. The class intervals of the two distributions must be the same, and all of the intervals must have the same width. Percentage frequencies must be used to give comparable areas. Two distributions that are to be 9 When the class intervals are unequal the transfer from the histogram to the polygon is not justified because in joining the midpoints of adjacent columns the triangular areas added and subtracted are not equivalent. 378 BUSINESS STATISTICS compared must be brought into conformity with these rules before the graph is planned. 1 TABLE 68 NUMBER AND PERCENTAGE DISTRIBUTION OF 500 FAMILIES IN BUFFALO, NEW YORK, ACCORDING TO VALUE OF MONTHLY RENTALS PAID (Approximated from reports of real estate dealers) FAlf IUXS Number Percentage Disttibution $ 7.50 and less than $17.50 93 18.6 17.50 and less than 27.50 167 33.4 27.50 and less than 37.50 127 25.4 37.50 and less than 47.50 59 11.8 47.50 and less than 57.50 30 6.0 57.50 and less than 67.50 12 2.4 67.50 and less than 77.50 7 1.4 77.50 and less than 87.50 4 .8 87.50 and less than 97.50 1 .2 Total 500 100. Figure 56 contains two polygons representing comparable material. One is a reproduction of Figure 52-B in per cents, showing the rentals paid in Columbus, and the other is a diagram of the data in Table 68, showing the distribution of rentals in Buffalo. The Buffalo distribution is only approximate because the reports of real estate offices may contain some duplication. In the main, however, the sample is a representative cross-section of rents in Buffalo. FIGURE 56 PER CENT COMPARISON OF Two DISTRIBUTIONS OF RENT DATA Data from Tables 61 and 68. PERCENT OF FAMILIES PERCENT OF FAMILIES oo 30 BUFFALO J^ 30 25 / )\ 25 20 1 / \ 20 L5 I / ^ 15 / / \^ 10 1 / ^ \COLUMBUS 10 5 ^/ ^ ^ \ 5 O / r"~~"*,^ , '""-^ . n 250 125O 2250 3250 4250 5250 6250 725O 8250 9250 10250 DOLLARS OF RENT PAID FREQUENCY DISTRIBUTIONS AND GRAPHS 379 Comparison of the two polygons shows that the general level of rents is lower in Buffalo than in Columbus. The Buffalo curve is some- what smoother since it is based on a larger sample. Regardless of the difference in size of sample there is a noticeably greater concen- tration of rents in Buffalo in the classes below $47.50, and correspond- ingly a much smaller proportion in the higher rent brackets. The higher rent level in Columbus is presumably an expression of an increase in demand for housing due to expanding personnel of the state government unaccompanied by an equivalent growth in housing facilities. If this explanation is correct, the difference in con- ditions is presumably temporary. If other basic causes are present, further investigation might uncover a permanent differential in the rent levels of the two cities. Types of Curves /One branch of advanced statistics deals solely with the various types of frequency curves and the development of measures used in analyzing them. The subject is introduced here in order to acquaint the student with the graphic appearance of the types of curves most frequently encountered in practical work. Any description of the different types of curves centers around the "normal" or "bell-shaped" curve. It is a portrayal of the distribu- tion of an infinite number of identically obtained measurements of a fixed object. 'That is, if an extremely large number of persons, all equally skillful, all possessed of normal vision and all using the same minutely graduated measuring device, were to measure the width of a room, the results would vary above and below the true width of the room as indicated by the normal curve. The curve is, therefore, a picture of the variations due to pure chance. But in practice pure chance is usually mingled with other uncontrolled causes of variation to such an extent that a normal distribution is seldom found outside the laboratory. Yet many of the distributions with which we deal differ so little from the normal curve that its characteristics are transferable, and in addition the normal curve serves as a guide to the description of other types of curves; In Figure 57 six other curves are presented with the normal curve. The two curves in B are symmetrical like the normal curve but one is flatter and the other more sharply peaked. The flat-topped curve would result from a distribution in which extraneous factors 380 BUSINESS STATISTICS FIGURE 57 TYPES OP CURVES had produced more variability than would arise from pure chance The peaked curve would result from a distribution in which extraneous factors tended to offset natural chance variability. C and D are skewed curves depicting distributions in which a controlling factor enters more strongly on one side of the peak than on the other side. These curves have a "long side" and a "short side." Methods of interpreting these are explained under the subject of skewness in chapter XVIII. E and F are less usual but are types that are encoun- tered occasionally. FREQUENCY DISTRIBUTIONS AND GRAPHS 381 Detailed analysis of these curves belongs in more advanced statistical work, but some of the properties of the normal curve and its use in the development of the principles of reliability of samples will be the subject matter of chapters XXVIII and XXIX. Ability to recognize the several types is essential to an understanding of the chapters dealing with averages and dispersion. The Lorenz Curve A special type of diagram used to show the nature of the concen- tration of cases in one or more frequency distributions is known as the Lorenz Curve. 10 The method of preparing data for presentation in a Lorenz Curve can be explained best from an example. Table 69 gives the number of independent retail grocery stores operating in Buffalo in 1929 and 1935 classified according to size as measured by sales. Column 3 is obtained by changing column 2 to per cents and in column 4 these per cents are cumulated. Column 7 is the result of three steps: (a) multiply the midpoint, column 1, of each class by the num- ber of stores, column 2, in the class to obtain the total sales by stores of that size, column 5; (b) express each of these products as a per- centage of the total sales of all stores, column 6; (c) cumulate these per cents. A parallel procedure using the frequencies for 1935 leads to the cumulative per cents of the lower part of the table. In plotting the points in Figure 58, the cumulative per cents of stores, column 4, are located from the base scale and the cumulative per cents of sales, column 7, from the vertical scale. Each curve therefore begins at the lower left-hand corner of the diagram and ends at the upper right-hand corner. 11 If all of the stores had equal sales, then any 10 per cent of the stores would have 10 per cent of the sales volume, any 20 per cent would have 20 per cent of the sales volume and so on, and the plotted points would fall on the diagonal line of the diagram. Hence this diagonal is known as the line of equal distribution. The departure of the actual curves from this line shows the extent of the concentration of sales volume in the larger-sized 10 This curve is named after M. O. Lorenz, who developed it and employed it mainly in his studies of wealth. See M. O. Lorenz, "Methods of Measuring the Concentration of Wealth," Journal of the American Statistical Association, New Series No. 70 (June, 1905), Pp. 209-19. 11 The base scale is sometimes arranged in reverse order, from right to left, so that the curves will extend from the right of the base scale to the top of the vertical scale. 382 BUSINESS STATISTICS s 8' S r H ls Pw 20 Is w H Q o z 29- 2 a/^ csj (3) PERCENTAGE ISTRIBUTION OF STORKS FROM (2) a a S 3gg C 2 ^T \0 M 06 ^ 6 xf vo <N 6 o\o OOOO oooo oooooooooooors OOOOOOOOOOOOcr v>(rsrvvMrMrNv^r>ir>^OO-i oooooooooooo oooooooooooo oooooooooooo oooooooooo< OOQOOOOOOO< O^O^O^OO^O^O^O O^O N O i < AO <so*ir\o*r\o"^o"o"< 2oooooooo OoOOOOOOo trMrko^^omoOirN ^1 \0 O *fV OS A M ^H -l r-l ^l S8i83 ^(rNiA^^irNirvirxirvmO^O^O^ M |C CN hf rf r^ 1 rT rf rT t^^o^o OOOOOOOOOOOO oooooooooooo oooooooooooo 8 o Sf ffi 2 O o S y 3 0, C K H| 1 Jllllllllllls oS ^6 FREQUENCY DISTRIBUTIONS AND GRAPHS 383 FIGURE 38 LORBNZ CURVES: CUMULATIVE PER CENTS OP STORES AND SALES, INDEPENDENT RETAIL GROCERY STORES IN BUFFALO, 1929 AND 1935 PER CENT OF TOTAL SALES 100 90 80 100 90 10 10 20 30 Data from Table 69. 40 50 60 70 PER CENT OF STORES 80 90 100 stores. The greater distance of the 1935 curve from the diagonal line shows the growth of concentration between 1929 and 1935. The Lorenz Curve is valuable, both in analysis and in presentation whenever distribution according to two quantitative attributes is of importance. There has been frequent use of this graph during recent years when distribution of business and income have been under discussion. PROBLEMS 1. a) What are the advantages of Table 57, page 352, of the text as com- pared with Table 56, page 351? b) Describe exactly how you would obtain Table 57 from Table 56. 384 BUSINESS STATISTICS 2. What information concerning rentals can be obtained from Figure 49, page 353? from Figure 50, page 354? 3. a) Name the four principles that must be observed in planning a frequency distribution. b) State the main points to be considered in applying each principle. 4. Indicate which of the following are correct statements and amend any that are incorrect: a) The presence of artificial grouping in an array can be disregarded in preparing a frequency distribution. b) All frequency distributions should have at least ten class intervals. c) Class intervals can be of equal or unequal width at the convenience of the person preparing a distribution. d) Class limits should be established so that the average value of the items included in each interval is approximately equal to the class mark of the interval. e) In preparing a distribution of continuous data the only way to designate class limits is by writing the class marks. /) The following is a discrete distribution: 500,000 up 300,000 up et 5. State wherein eacJ construction of a A to 1,000,000 9 to 500 000 if> c. i of the following meets or fails to meet the principles of frequency distribution. t B INCOME AVERAGE MONTHLY RENT AGE (YEARS) No. OP PERSONS All * ***** 6,930,446 535,600 100,398 577,284 575,300 1,287,625 1,345,984 Under $ 500.. $ 500 to 700 . . 700 to 1,000. . 1,000 to 1,200. 1,250 to 1,500. . etc. $25.90 22.90 22.80 26.00 28.10 All ages Under 5 Under 1 5 to 9 10 to 14 15 to 24 25 to 34 C etc. D TYPBOF DWELLING No. OF FAMILIES PROVIDED FOR EARNING OVER $4,000 YEAR OF One-family 4,620 159 1,195 GRADUATION Per Cent Per Cent of Class of Group Two-family Multi-family 1935 2 18 Total 5.974 1936 5 26 FREQUENCY DISTRIBUTIONS AND GRAPHS 385 6. a) Describe the construction of a histogram ; a frequency polygon. b) What is the relation of the two types of diagram to discrete and con- tinuous data? 7. For what kinds of information is the ogive preferable to the ordinary dis- tribution? 8. a) Under what circumstances is it desirable to use unequal class intervals? b) Explain with an example of your own the method of preserving areas in a diagram of a frequency distribution with unequal class intervals. 9. a) What is the reason for using percentage frequencies in comparing two distributions? b) In what situation would percentage frequencies be unnecessary for com- paring two distributions ? 10. a) Make a frequency table, using the 112 items in the 4 columns assigned to you from the following table. (See numbered assignments at the top of page 386.) b) Give reasons for your choice of class limits and width of class intervals. c ) Draw a graph showing your frequency distribution. d) What information concerning wages of semi-skilled female workers in this hosiery mill can be derived from your table and graph? WEEKLY EARNINGS OF 168 SEMI-SKILLED FEMALE WORKERS, IN HOSIERY MILL XYZ * (a) (b) M (d) (e) (/) 15.20 1800 11.20 1600 2000 13.60 11.60 14.00 12.00 11.30 12.20 12.00 8.00 12.00 17.60 15.60 8.50 8.00 1280 12.80 9.50 12.00 14.50 10.00 14.00 11.80 12.00 1060 16.00 12.60 6.40 9.20 14.00 12.00 12.60 14.00 12.00 7.60 12.00 15.00 12.00 6.50 12.40 14.80 8.20 6.00 8.00 16.00 24.00 18.00 28.00 8.00 19.00 14.00 14.60 16.80 16.80 16.00 22.00 1460 9.00 14.20 14.40 17.20 15.20 19.20 16.50 12.00 21.20 14.40 10.00 12.30 2000 12.00 20.00 12.50 14.00 11.60 18.00 21.00 23.00 2000 1600 1640 14.10 8.00 14.00 18.80 16.40 16.00 22.50 16.00 16.10 12.00 12.00 2000 12.00 24.00 19.90 12.00 23.80 21.40 20.80 19.60 12.90 8.40 2840 24.00 16.00 27.00 2400 23.50 17.30 28.80 18.00 20.00 16.00 2000 18.00 15 20 7.20 10.40 800 21.60 14.00 25.00 14.00 15.50 11.80 2440 11.40 12.00 26.00 21 80 1500 14.00 24.50 20.40 16.00 14.00 16.00 16.20 6.00 17.60 16.00 6.00 12.40 28.00 20.00 8.80 12.00 16.00 18.40 16.90 16.00 16.00 19.40 12.40 15.50 13.00 12.00 18.00 10.00 16.00 6.00 14.00 13.20 12.00 > Based on similar data appearing in a 1939 tone of the Monthly Labor Review. 386 BUSINESS STATISTICS Assignments No. Columns No. Columns No. Columns \abcd 6 a b * f 11 b c d 2 a b c e 1 a c d 9 12 b c d f 3 a b c f 8 a c d / 13 beef 4 a b d e 9 a c f f 14 b d e \ 5 a b d f 10 a d f 15 c d * j REFERENCES CHADDOCK, ROBERT E., Principles and Methods of Statistics. Boston: Houghlon Mifflin Co., 1925. Chapter V deserves a careful reading, particularly the location of partition values, pp. 61-^5. ELDERTON, W. PALIN, and ELDERTON, ETHEL M., Primer of Statistics. London: A. and C Black, Ltd., 1920. Chapters I-IV give an extremely simple statement of the fundamentals of the preparation of frequency distributions and the meaning of measures of central tendency and dispersion. MUDGETT, BRUCE D., Statistical Tables and Graphs. Boston: Houghton Mifflin Co., 1930. The explanation of "Graphs of Frequency Distributions" on pp. 102-21 warrants careful reading. TRELOAR, ALAN E., Elements of Stathlicitl Reasoning. New York- John Wiley and Sons, 1939. A dear explanation of the difference between discrete and continuous dis- tributions and adjustment for intervals of different width, pp. 26-35. YULE, G. UDNY, and KENDALL, M. G., Introduction to the Theory of Statistics. London: Charles Griffin and Company, Ltd., 1937. The basic principles of the construction of frequency distributions and graphs are presented in chapter VI. CHAPTER XVI MEASURES OF CENTRAL TENDENCY AVERAGES OF CALCULATION INTRODUCTION THE whole process of statistical analysis is characterized by the attempt to reduce the details of masses of data and to develop summary figures. The initial stages in this analysis have already been pointed out: a statistical table classifies masses of separate items into a small number of comparable groups; a graph is planned to con- centrate attention on one or a few major characteristics of a set of data; a ratio involves the substitution of one simple figure for two or more unwieldy ones; and a frequency distribution condenses a long list of separate items into usable form by substituting class values for indi- vidual values. The statistician's working equipment must include a knowledge of these various descriptive devices. Another basic tool needed in analysis is the average. An average is frequently described as a "measure of central tendency" because it provides a single sum- mary figure by means of which an entire set of data may be represented. Measures of central tendency are familiar to statisticians and laymen alike in such examples as average weekly wages, average prices of securities, average daily temperature, a man of average height, a medium-sized house, and the usual amount of rainfall. Familiarity, however, tends to obscure the fact that several different concepts of "average" are involved in these examples and that different methods of computation are employed in obtaining them. It follows, therefore, that several types of summary figures or averages must be explained in developing the subject. Measures of central tendency fall into two groups: (1) those ob- tained by calculation, and (2) those defined by position. Each group contains two fundamental averages that have sufficient practical appli- cation to warrant explanation in this book. They are, Averages of Calculation Averages of Position Arithmetic Average Median Geometric Average Mode 387 388 BUSINESS STATISTICS The remainder of this chapter will be devoted to the arithmetic average and the geometric average. The averages of position and criteria for evaluating the four measures of central tendency will be described in the following chapter. THE ARITHMETIC AVERAGE The Average of Ungrouped Data The measure of central tendency most commonly known and recog- nized is the arithmetic average, which is frequently called the arithmetic mean or, more simply, the mean. It is calculated by adding together all the items in a group or series, and dividing their sum by the number of items. For instance, the arithmetic average of a student's examina- tion grades is calculated by adding the grades of all the examinations and dividing their sum by the number of examinations. Table 70 illus- trates the method. TABLE 70 CALCULATION OF ARITHMETIC AVERAGE OF EXAMINATION GRADES First examination 75 Second examination 93 Fourth examination 88 Third examination 87 Fifth examination 93 Total, 5 examinations 436 Average = 436 -J- 5 = 87.2 There are several characteristics of this simple problem which should be emphasized because they apply to all arithmetic averages. First, all five items are included in the total, which is divided by 5 to obtain the average. Second, a change in any one of the examination grades will affect the value of the average; the average of grades would be increased 5 points by changing the grade of 75 to 100. All students have without doubt made this kind of calculation by assuming different examination grades, before and after an examination. Third, extreme values, either high or low, may produce a value of the average which is not representative of the data. Unusual values have the greatest effect when the average is based on a small number of items. This method of calculating the arithmetic average can be applied to the sample of rentals paid by families in Columbus as arrayed in Table 57, page 352. The sum of all the 155 monthly rentals is MEASURES OF CENTRAL TENDENCY 389 determined to be $6,307. This total divided by 155 gives the arithmetic average, $40.69.* If each of the 155 families in the group had paid a monthly rental equal to the arithmetic average, $40.69, the total amount would be the same as was actually paid. The arithmetic average, then, is the value which can be substituted for every actual value in a group without changing the sum of all the values. The Weighted Average and Weighted Total There are many cases in which values to be averaged or totaled are of different degrees of importance. When this is the situation, it is necessary to weight the separate items by multiplying each by a factor which represents its relative importance in the group. Weighted Average. The problem of weights usually arises in the calculation of course grades. For example, suppose that the fifth grade, 93, in Table 70, was for a final examination, and consequently was twice as important as any other examination grade received. It would be multiplied by 2 and the total divided by 6 (the sum of the weights) as indicated in Table 71. The weighted average is one grade point larger than the unweighted average. In some cases the weighting might cause a much greater difference than one grade point, and it might cause the average to increase (as in this case) or to decrease. TABLE 71 CALCULATION OF WEIGHTED AVERAGE OF EXAMINATION GRADES EXAMINATION GRADE WEIGHT GRADE X WEIGHT First 75 1 75 Second 93 1 93 Third 87 1 87 Fourth 88 1 88 Final 93 2 186 Total 6 529 Weighted arithmetic average = 529 -f- 6 88 2. 1 A formula for the arithmetic average can be developed from this calculation, Arithmetic average = , N where 2 (sigma) stands for "the sum of"; X stands for any value of the variable, rent; and N represents the total number of items (that is, rentals). There are several com- monly used symbols for the arithmetic average which are employed under different circumstances. In elementary work it is usually represented by its initials, A.A., or by Af (Mean). The latter will be employed in this text. If several averages are being used, a subscript is added to M to designate the variable of which it is the average: e.g., M* for the arithmetic average of the X's, M 9 for the average of the y's, etc. When used in algebraic_manipulation the average is sometimes represented by its formula, or by the symbol. X. 390 BUSINESS STATISTICS The weighted average of examination grades alone does not pro- vide a complete basis for a course grade. For the latter, it is necessary to include laboratory work, classroom responses, and special assign- ments or reports, as well as examinations. To average the grades of these diverse elements, each of which is of different importance, it is again necessary to employ weights. Since each type of grade included might be a weighted average, like the examination grade above, the final course grade becomes a weighted average of weighted averages. A weighted average requires careful consideration of the items being averaged, in order to arrive at an equitable basis for establishing the weights. In the case of a course grade in statistics, after scrutinizing all elements involved, it may have been decided that they should have the measures of importance shown in column 1, Table 72. The per cents which represent the importance of each course element in the final grade may be called weights. The weighted average is obtained by (1) multiplying the value of each item by its weight, a measure of its importance in the total; (2) dividing the sum of these products by the sum of the weights. The method of calculating a weighted average to obtain a course grade for a student in elementary statistics is shown in Table 72. TABLE 72 CALCULATION OF WEIGHTED AVERAGE OF ELEMENTS OF COURSE TO OBTAIN FINAL COURSE GRADE ELEMENT (l) WEIGHT (per cent) (2) AVERAGE GRADE (3) AVERAGE GRADE X WEIGHT Examinations 60 88 5 280 Laboratory work 20 80 1 600 Classroom response 10 90 900 Homework 10 70 700 Total 100 8.480 Weighted average * = = 84.8 * If the letter W stands for the weight and n stands for the number of items, the process can be described algebraically as: w . . t . Mwm _Xi^i Weighted average = - In summary form: Weighted average = The weighted average, 84.8, is greater than the simple arithmetic average (the sum of the grades in column 2 divided by 4), by more than 2 points. This increase of 2 points over the simple average is due MEASURES OF CENTRAL TENDENCY 391 to the increased importance which is given to the high average exam- ination grade of 88 through the process of weighting. The weights of the several parts of the statistics course are shown in Table 72 as per cents, the sum of which is 100, representing the whole course. From elementary arithmetic it will be remembered that the numerator and denominator of a fraction can be multiplied or divided by the same number without changing the value of the fraction. Consequently, the absolute values of the weights can be replaced by relative values proportional to the absolute values. The relatives are easier to use and give the same result. In a study of wages of farm labor in Vermont for the period 1780- 1937, a weighted average is used to calculate the annual average of day wage rates. '"The annual wage rates of labor hired by the day are weighted averages of the monthly data." 2 The weights assigned to the average daily wage rates of the several months are: January 5 May 8 September 8 February 4 June 10 October 8 March 5 July 18 November 6 April 7 August 16 December 5 Total 100 Examination of these weights reveals that they are closely related both to the seasons and to the number of days in each month. The day wages in July, for instance, are of most importance because of the great amounts of farm labor hired in that month, the usual clemency of the weather, and the critical period for growing crops, as well as the num- ber of working days in the month. The day rates in July, then, should have the greatest influence in the determination of the annual average. Conversely, February rates should have the least influence. Effect of Weights. One might justifiably ask, "What are the effects of weighting on the simple average?" The answer may be generalized as follows: 1. If the larger weights are applied to the smaller values, and the smaller weights to the larger values, the influence of the smaller values will be increased, and the value of the weighted average will be smaller than the value of the unweighted average. 2. If the larger weights are associated with the larger values and *T. M. Adams, Prices Paid by farmers lor Goods and Services and Received by Them for Farm Products, 1790-1871; Wages of Farm Labor, 1780-1937, (A Preliminary Report from University of Vermont and State Agricultural College, Burlington, Vermont, February, 1939), pp. 43-44. 392 BUSINESS STATISTICS the smaller weights with the smaller values, the influence of the larger values will be increased and the resulting average will be larger than the unweighted average. 3. If there is no relation between the sizes of the weightf'and the values of the items, that is, if large weights are as frequently assigned to small items as to large items and small weights are similarly dis- tributed, the difference between the weighted average and the un- weighted average is likely to be very small and entirely due to chance; in fact under these circumstances the two averages may be the same. 8 The computation of a weighted average can be explained from Table 73. The per capita sales in column 4 are the results of dividing each sales figure of column 3 by the corresponding population in col- umn 1. The average per capita sales for the entire table is not obtained by averaging the figures in column 4 because the populations repre- sented by the cities in the several size groups are different. The proper method is to multiply each per capita figure by the population of that group, sum the products, and divide by the total population. The result of this operation is $97.79, as indicated. The same figure can be ob- tained by using as weights the percentage distribution of the population TABLE 73 SALES OF RETAIL FOOD STORES (1935) AND POPULATION (1930) IN CITIES OF DIFFERENT SIZE * SIZE OF CITY (1) (2) POPULATION (3) RETAIL FOOD SALES (in millions) (4) PKH CAPITA SALES Number (in thousands) Percentage Distribution 250000 and over 28,785 9,771 19,784 10,615 23,375 31.2 10.6 21.4 11.5 25.3 $2,839.3 941.5 1,998.5 1,175.2 2,074.2 $ 98.64 96.36 101.02 110.71 88.74 75 000-250 000 10,000- 75,000 2 500- 10 000 Under 2,500 (excluding farms) Total 92.330 100. $9.028.7 i 97.79 ' Population, United States Census, 1930; Food store sales. United States Census of Business, 1935. in column 2, and dividing the sum of the products by 100. The average per capita sales can also be computed by dividing the total sales by the total population, i.e., $9,028,700,000 -f- 92,330,000 = $97.79. 8 Cf E. C. Rhodes, Elementary Statistical Methods. (London: George Routledge & Sons, Ltd., 1933), pp. 143-45. MEASURES OF CENTRAL TENDENCY 393 When all of the necessary information is available as in this table, the weighted average should be computed as the ratio of the two per- tinent totals. In most cases only the individual ratios will be given. Then weights must be found in order to compute the weighted average. The rule for determining these weights was stated in chapter XI in discussing the problem of averaging ratios. 4 That development shows that in averaging the per capita figures the denominators from the com- putation of the individual per capita figures must be used as weights. But the average per capita sales in each size of city multiplied by the population in that group gives the total sales in the group and the sum of these products is the total sales. Therefore, the two methods just described for computing the weighted average are identical and one or the other should be used according to the form in which the data are available. The use of the population column for weighting is required in order to retain in the weighted average a characteristic of the simple average, i.e., that the average times the number of items will give the total value of the original items. In the weighted average that rule becomes: the weighted average times the total of the absolute weights must equal the total value of the original items. In the example the average per capita sales times the total population equals the total sales. This characteristic will be referred to as the total value criterion. The unweighted average of column 4 is $99.10. The effect of the weights, therefore, has been to reduce the value of the average. This means that large weights tend to be associated with small per capita sales and small weights with large per capita sales. Survey of the table shows that, although the relation is somewhat mixed, the next to the largest weight is attached to the smallest per capita figure and the next to the smallest weight is attached to the largest per capita figure. Another use of the weighted arithmetic average is found in the computation of the percentage changes in retail sales of independent dealers in Ohio. In this case the purpose is to derive from sample data an average figure that will represent the universe. Regular reports of current retail sales are collected monthly from a large number of co-operating merchants representing many lines of retail trade. By classifying and tabulating these reports according to the lines of trade which they represent, it is possible to calculate the percentage changes * See Chapter XI, pp. 265-49. 394 BUSINESS STATISTICS from the previous month in retail sales for all independent dealers in Ohio, as shown in Table 74. TABLE 74 CALCULATION OF PERCENTAGE CHANGE IN RETAIL SALES BY INDEPENDENT DEALERS IN OHIO, USING WEIGHTED AVERAGE OF RELATIVES DERIVED FROM SAMPLE REPORTS REPRESENTING VARIOUS LINES OF TRADE, SEPTEMBER, 1939, DIVIDED BY AUGUST, 1939 LINK OF TRADE (1) PERCENTAGE RELATIVES SEPT. 1939 + AUG. 1939* (X) (2) SALES OF EACH LINE AS RELATIVE OF TOTAL NET SALES 1935 CENSUS (W) (3) WEIGHTED PERCENTAGE RELATIVES (1) X(2) (XW) Grocery without meats 116 89 0384 449 Grocery with meats 110 00 1620 1782 Country general 102 88 0264 2 72 Department stores 125 45 1367 17 15 Men's and boys' clothing 126 67 0277 3 51 Family clothing 106 88 0112 1.20 Women's ready-to-wear 123 54 0260 3 21 Shoes 147 16 0125 1.84 Motor vehicles 86 73 2031 17 62 Gasoline filling stations 98 95 0971 9.61 Furniture stores . 100 51 0380 3 82 Household appliances 122 43 0123 1 51 Radio stores 108 30 0026 .28 Lumber and building material .... Heating and plumbing 95.71 119 09 .0297 0050 2.84 .60 Hardware . ... 107 16 0376 4 03 Restaurants 105 47 0784 8.27 Drugs 100 32 0381 3.82 Florists 111 29 0055 .61 Jewelry . 108 54 0118 1 28 Total 1 000 106 23 Weighted percentage change. . . . +6.23 Unweighted percentage changed* +15 18 * Unpublished data. t The unweighted percentage change is obtained by using the unweighted totals of reports from all lines of trade. The use of a weighting system with a total of 1.0 is explained in chapter XII. It would be an easy matter to add the sales reported in all lines for the respective periods. The percentage change would then be the ratio between the two totals less 100 per cent. But this method, though simple, involves a basic error. The total sales of the reporting stores in each line of trade do not bear the same proportionate relationship to the total of all sales reported as the total value of these same lines of trade hold to the actual total value of all retail trade. This dispro- portionate relationship is due to the voluntary co-operative arrangement by which the collection process is carried on. In order to overcome this difficulty weights have been introduced. The percentage relatives, MEASURES OF CENTRAL TENDENCY 395 column 1, calculated from the total of the reported sales of each line of trade, are multiplied by a number, column 2, representing the pro- portion which sales by independent dealers in each of the lines of retail trade constituted of the total of retail sales by independent dealers in Ohio as reported in the Census of Business of 1935. 5 The sum of these products, column 3, becomes the basis for calculating the weighted percentage change. The effect of weighting is very important here, for the weighted change shows an increase of 6.23 per cent, whereas the unweighted change shows an increase of 15.18 per cent. In this example the weights are not the denominators of the ratios to which the weights are applied. As a result, the weighted average times the sum of the absolute weights (the total sales upon which the relatives in column 2 are based) will not give the total value (in this case total sales reported in September, 1939). This departure from the established criterion is justified by the purpose of the computation, which is to estimate from the reported sample the percentage relation in a universe. The figure 106.23 is the best obtainable estimate of the percentage relation between the sales in September and August of all independent retail outlets in Ohio in the included lines of trade. That is, if it were possible to know the sales of all such stores in the state in August, the result of multiplying that figure by 1.0623 would ap- proach the actual sales of these stores in September. This extension of the weighted average is frequently involved in estimating the con- ditions of a universe from those found in a sample. Choice of Weights. The problem of selecting weights does not arise so long as the total value criterion is adhered to. Two types of cases must be considered in which the criterion is abandoned. In both of these the question of choice of weights is a necessary preliminary step to the computation of the average. The principles involved in the choice are discussed in general terms here. A more specific discus- sion of the weighting problem in the construction of index numbers is presented in chapter XIX. The first type of case arises when the conditions of a universe are to be inferred from a sample. The computation of the percentage change in retail trade in Ohio presented in the preceding section is an 5 It should be observed that the weights in this case represent the fractions which the annual volumes of the separate lines constituted of the total annual volume in 1935. To the extent that the different retail lines are differently affected by seasonal influences, these weights introduce errors. It is felt, however, that the error thus introduced is less than that due to lack of representative sampling. 396 BUSINESS STATISTICS example. The weights in this case represented the relative importance of the several lines of trade and were applied to percentage relations in the sample. This is a standard method of weighting to transfer from sample to universe, but by no means a universal one. Data in different form may require a different method of weighting. Therefore, no rule can be offered except the broad one that the weights must be so selected that the characteristics of the universe will be correctly inferred. The second type of case appears when the true weights required to preserve the total value criterion are unavailable but some systematic weighting is an obvious necessity. The choice lies between approximat- ing the missing true weights and adopting an arbitrary weighting sys- tem. The first alternative is employed in computing the average farm price of wheat at a particular time. A large number of individual prices are reported from various parts of the country but no corresponding reports of the amounts sold at various prices are available. Skillful estimators supply the missing quantity weights on the basis of whatever auxiliary knowledge can be gleaned from sources at their command. The result is a weighted average farm price closely approximating the correct average price. There is another reason for the success of this procedure in the hands of practiced estimators. Small variations in the weighting system will have comparatively little effect on the value of the average. For this reason Bowley stated as a general rule, "In calculating averages give all care to making the items free from bias, and do not strain after exactness of weighting/' fl The second alternative, arbitrary weighting, is employed whenever an approximation to the true weighting system is not feasible. An example is found in the weights suggested on page 391 for determining an average monthly wage rate for farm labor. The weights attached to the several months are a composite of several elements and, since the total value criterion has been abandoned, the accuracy of the weighted result is largely dependent upon the judgment of the one who established the weighting system. The full meaning of judgment weighting is brought out in chapter XXV in connection with the con- struction of an index of business conditions from component series of unequal importance. Weighted Total. In some cases the weighted average is not as useful as the weighted total, i.e., 2 (WX). For the most part this Arthur L. Bowley, Elements of Staff stiff (London: P. S. King and Son, Ltd., 1920, fourth edition), p. 94. MEASURES OF CENTRAL TENDENCY 397 applies when some standard unit of measurement exists independent of the particular investigation, and comparisons must be made in terms of that unit. For example, in calculating the cost of living of families for use in the administration of unemployment relief, the problem of family size arises at once. The number of persons per family is not a sufficiently accurate standard. Families of five persons, for instance, are not all equivalent, for the ages and sexes of the members of a family are very important in determining food requirements as well as in estimating clothing needs and housing costs. One approach to the solution of the problem of calculating food requirements which was presented by a group of experts under the auspices of the League of Nations involved assigning weights to persons according to their age and sex. A value of 1 was given to a male between 14 and 59 years of age and all other ages were assigned values relative to this. The scale of weights which was developed follows: AGE AMD SKX WEIGHT Under 2 years, male or female 2 2- 3 years, male or female .3 4 5 years, male or female 4 6 7 years, male or female .5 8 9 years, male or female .6 1011 years, male or female .7 12-13 years, male or female .8 1459 years, male 1.0 14-59 years, female .8 60 years and over, male or female 8 In order to obtain the weighted total of food units required for a family, it is necessary to assign the proper weight to the number of family members according to age and sex, and add the products. For instance, food units required for a family of five members, father aged 40, mother aged 35, one son aged 15, and two daughters, one aged 10 and one 11, is calculated as follows: MEM BE* NUMBER WEIGHT TOTAL UKITI Father 40 1 male 1.0 1.0 Mother 35 1 female .8 .8 Son 15 1 male 1.0 1.0 Daughters 10 and 11 2 females .7 14 Total 5 members 4.2 The weighted total of 4.2 represents the total number of food units required for the family in question. This is an average of .84 food 398 BUSINESS STATISTICS units per person for this family. Since the natural unit for relief is the family and not the individual, the weighted total is in this case a more useful figure than the weighted average. It should be observed that this system of weighting did not differentiate food requirements of the different sexes under 14 years of age or 60 years and over. The Average of a Frequency Distribution Frequency distributions are so commonly used in analyzing and describing various kinds of business data that it is necessary to examine the methods by which an arithmetic average can be calculated from such a distribution. The method is very similar to that employed in obtaining a weighted average. Each midpoint of a class interval is multiplied by the frequency of that interval. The sum of the products of midpoints multiplied by the frequencies represents the total value of all items in the distribution. When this total is divided by the sum of the frequencies, the result is the arithmetic average of the distribution. Frequency distributions should be constructed so that the class marks represent all the values included in a class interval. Although this standard for construction may not always be attained, the class limits should be so established that in each class the midpoint of the class interval is approximately equal to the average of the actual values of the items. Each midpoint will therefore represent all the values in the class. 7 TABLE 75 CALCULATION OF THE ARITHMETIC AVERAGE FROM THE FREQUENCY DISTRIBUTION OF RENTALS PAID BY 155 FAMILIES IN COLUMBUS, OHIO CLASS INTERVAL (dollars) (1) CLASS MIDPOINT X (2) FREQUENCY (3) FRFQUKNCY X MIDPOINT fX 7 *>0 and under 17 50 12 50 16 200 00 17 50 and under 27 50 22 50 27 607 50 27 50 and under 37 50 32 50 44 1 430 00 37 50 and under 47 50 . ... 42 50 17 722 50 47 50 and under 57 50 52 50 18 945 00 57 50 and under 67 50 62 50 11 687.50 67 50 and under 77 50 72 50 10 725.00 77 50 and under 87 50 .... 82 50 9 742.50 87.50 and under 97-50 92.50 3 277.50 Total 155 6,337.50 M = 6337.50 = $40.89 155 7 See chapter XV for a complete discussion of the characteristics of frequency distri- butions which will affect the calculation of arithmetic averages. MEASURES OF CENTRAL TENDENCY 399 Direct Calculation. The arithmetic average of the rent distribution which was constructed according to this principle is calculated in Table 75 to illustrate the method. Since $12.50, the midpoint of the first class interval, $7.50 to $17.50, is assumed to represent the average rental paid by the 16 families whose rentals fall within the interval, the total amount paid by the 16 families should be $12.50 multiplied by 16, or $200.00. Likewise, each product in column 3 represents the total rental paid by the families in that class interval of rentals. The total of column 3, $6,337.50, should therefore be approximately equal to the total amount paid for rent by all families in the sample. The arithmetic average is found to be $40.89 by dividing this total by 155, the number of families. 8 It will be observed at once that $40.89 differs slightly from the figure that was obtained by the computation of the arithmetic average of the original data on page 389, and the computed total rentals paid, $6,337.50, is likewise a different total. A quesion arises at once as to which average or total ought to be used. Obviously the computations from the original data are more precise, but whether they should there- fore be used will depend upon the purpose. The availability of the original data may also be a determining factor. On many occasions data are available only in frequency distributions so that there is no question as to which method of calculation to employ. If the average and total were being used by a rental office in con- nection with its accounting records, the values of each separate rental would be on file and the simple arithmetic total and average would probably be computed from them. On the other hand, if the average of this sample is to be used to represent the average rent paid by all 8 It is frequently helpful to be able to describe the calculation of the average from the frequency distribution by a formula. Such a formula can easily be developed from the computation just completed. The procedure was as follows: Arithmetic average = (frequency of first class interval X midpoint of first class interval) -j- (frequency of second class interval X midpoint of second class interval) + etc., divided by the sum of the frequencies. Using the symbols at the head of each column in Table 75, the formula for the arithmetic average becomes: _ X is used to denote the values of the midpoints of class intervals in a frequency distribu- tion just as X would be used to denote the individual values if the data were ungrouped If a student understands the procedure in the calculation of the arithmetic mean, he need not memorize this formula. Note, however, that the denominator is always the total number of frequencies, not the number of classes in the distribution. 400 BUSINESS STATISTICS families in the city, calculation from the frequency distribution is pref- erable. 9 The frequency distribution is a device for summarizing data and for reducing the amount of work involved in calculating statistical measures. When, as in the case of the rentals, a relatively small num- ber of items is included in the distribution, differences may occur in the measures of central tendency calculated from the grouped and ungrouped data. As the size of the sample increases, however, differ- ences in these calculated values tend to disappear. Short-Cut Calculation. The direct method of computing the arith- metic average from a frequency distribution is not an involved process, but the actual steps of multiplication, particularly in large distributions, may become a real burden. The number of computations can be reduced by the use of short-cut methods and as a consequence the chance for errors will be decreased. Although at a first glance, or upon an initial trial, these methods may not appear shorter, a little practice will con- vince anyone except an arithmetic wizard that much time and labor can be saved by employing them. They also lay the foundation for a much greater saving in more advanced calculations. Method 1 : The arithmetic average of rentals from the sample of 155 families is calculated by short-cut method 1 in Table 76. 10 In carrying out the illustrative computations the first step is to select one of the midpoints 11 as an assumed average. Before calculation, of course, the average is not known, but for illustrative computation A the See page 368 for discussion of the use of frequency distributions of sample data for representing the characteristics of a larger universe. 10 This text employs a fixed notation, the basis of which is as follows: 1. Capital letters (X, Y ) are used to denote values of variables measured from zero, e.g., the miapoints of class intervals in column 1, Table 76. 2. Small letters (x, y ) are used to denote values of variables measured from the average, e.g., the differences of the midpoints of the rent classes from the average rent, $40.89. 3. The letter, d, is used to denote values of variables measured from an assumed average, e.g., columns 3 and 5, Table 76. 4. Primes and subscript letters will be used to distinguish variables that are being compared, e.g., d and d' in columns 3 and 5 to indicate deviations from two different assumed averages, and d. to indicate deviations in steps in Table 77, column 3. 5. The measures of central tendency will be designated as follows: M = true arithmetic average M' =. assumed arithmetic average Me = median Mo = mode G.M. = geometric average. 6. Additions to the notation will be made in subsequent chapters as the need arises. 11 Any value can be used as the assumed average, but it has become customary to use the midpoint of a class interval because it affords easiest computation. MEASURES OF CENTRAL TENDENCY 401 TABLE 76 SHORT-CUT METHOD 1 FOR COMPUTING THE ARITHMETIC AVERAGE FROM THE FREQUENCY DISTRIBUTION OF RENTALS PAID BY 155 FAMILIES IN COLUMBUS, OHIO ILLUSTRATIVE ILLUSTRATIVE COMPUTATION A COMPUTATION B (1) (2) (3) (4) <5) (6) Dollar Dollar CLASS INTERVAL (dollars) CLASS MID- POINT FRE- QUENCY Deviation of Midpoint from Assumed Average Frequency X Deviation (2) X (3) Deviation of Midpoint from Assumed Average Frequency Deviation (2) X (5) of $42.50 of $22.50 X / d fd d' fd' 7.50 and under 17.50. . 12.50 16 30 480 10 -160 17.50 and under 27.50. 22.50 27 20 540 27.50 and under 37.50. 32.50 44 10 440 + 10 +440 37.50 and under 47.50. 42.50 17 +20 +340 47.50 and under 57.50. 52.50 18 + 10 + 180 +30 +540 57.50 and under 67.50. . 62.50 11 +20 +220 +40 +440 67.50 and under 77.50. 72.50 10 +30 + 300 +50 +500 77.50 and under 87.50. 82.50 9 +40 + 360 +60 +540 87.50 and under 97.50. 92.50 3 + 50 + 150 +70 +210 Total 155 250 ... +2,850 Illustrative Computation A: Af' = 42.50 M =42.50+ (-250-7- 155) = 42.30- (+250- 155) = 42.50 1.61 = $40.89 Illustrative Computation B: M' 22.50 M =22.50 + (2,850 -h 155) = 22.50 + 18.39 = $40.89 midpoint $42.50 is chosen as the assumed average. The interval for which the midpoint is $32.50 is $10 less than the assumed average, and so is shown in column 3 of Table 76 as deviating from the assumed average by $10; the midpoint $22.50 is shown as $20 less than $42.50; the midpoint at $52.50 is shown as $10 more than the assumed average, etc. These differences are called actual dollar deviations of the midpoints from the assumed average, and are shown in order in column 3. The deviations are multiplied by their respective frequencies, and the products, retaining the algebraic signs of the deviations, are shown in column 4. These products are the amounts of difference between the total rentals actually paid in each class and the rentals that would have been paid if everyone had paid a rental equal to the assumed average. For instance, the 16 families in the first class interval actually paid $480 less than they would have paid if each of the 16 families had paid $42.50 per month in rent. The net total of column 4 402 BUSINESS STATISTICS indicates that the whole group of 155 families actually paid $250 less in rent than they would have paid if everyone had paid $42.50, the amount of the assumed average. With this information at hand, the arithmetic average can be computed as follows: Arithmetic average assumed average + the net difference divided equally among all the items included (prorated net difference), i.e., M = 42.50 -f ( 250 ~- 155) = 42.50 1.61 = $40.89 This is the same average that was obtained by the direct method in Table 75. Illustrative computation B, at the right of Table 76, columns 5 and 6, shows that the assumed average can be taken at a different midpoint with identical results. Following the same procedure as was employed in the direct method, a formula for short-cut method 1 can be developed: Method 2: This method is a modification of method 1. The same deviations in actual amounts are used, but instead of being taken as actual values they are counted as equal ''steps" of deviation from the assumed average. For instance, in this case each step is defined as equal to $10. The calculation is then made as in Table 77. The width of the class interval is conveniently chosen as the step in distributions having equal class intervals. The midpoint $42.50 is again chosen as the assumed average. Each $10 of deviation of a class interval midpoint from the assumed average is considered as one step, as in column 3. The multiplication of frequencies by these step devia- tions is done just as it was in the former illustration. For instance, the sum of the products of the frequencies and step deviations, column 4, indicates that the 155 families taken all together paid the equivalent of 25 steps (each $10 wide) less in rent than they would have paid if each rental had been equal to the assumed average. The value of these 25 steps must now be prorated among the 155 rentals paid and reduced to dollar figures. The calculation is as shown below Table 77. The formula for this method can be written as: in which /' the width of the step (usually the class interval) ex- pressed in the original units. The chief advantage of this method over short-cut method 1 is that the multiplications are so reduced in size MEASURES OF CENTRAL TENDENCY 403 that they can usually be performed mentally. In computing the average, the final multiplication of the prorated net difference by the width of the step must never be overlooked. Frequency Distributions with Unequal Classes or Open Ends. On some occasions, it is necessary to compute an arithmetic average from a frequency distribution in which the class intervals are unequal in width or which contains open intervals at either end. An open-end frequency distribution is one in which the lower limiting value or the TABLE 77 SHORT-CUT METHOD 2 FOR COMPUTING THE ARITHMETIC AVERAGE BY STEP DEVIATIONS FROM THE FREQUENCY DISTRIBUTION OF RENTALS PAID BY 155 FAMILIES IN COLUMBUS, OHIO (1) (2) (3) (4) CLASS INTERVAL (dollars) CLASS MIDPOINT FRE- QUENCY DEVIATION IN STEPS FROM ASSUMED FREQUENCY X DEVIATION IN STEPg AVERAGE (2) X (3) X / d. f*. 7.50 and under 17.50. . 12.50 16 -3 -48 17.50 and under 27.50. 22.50 27 2 -54 27.50 and under 37.50. 32.50 44 1 44 37.50 and under 47.50. 42.50 17 47.50 and under 57.50. 52.50 18 + 1 4-18 57.50 and under 67.50. 62.50 11 4-2 4-22 67.50 and under 77.50. 72.50 10 + 3 4-30 77.50 and under 87.50. 82.50 9 4-4 4-36 87.50 and under 97.50. 92.50 3 4-5 4-15 Total 155 25 M' = 42.50; /= 10.00 M = 42.50 4- (25 ~ 155) 10 = 42.50+ (.161)10 = 42.50 1.61 = $40.89 upper limiting value, or both, is not indicated. Although open ends should be avoided whenever possible, if it is felt that such a class is necessary the total value of all the items in the class, or their average value, or the value of each individual item should be given in a footnote to aid in the description and analysis of the distribution. Unless infor- mation is supplied in one of these three forms, it becomes impossible satisfactorily to calculate the arithmetic average of the distribution, because the data required in the computation are not all available. Table 78 illustrates an open-end frequency distribution that also contains unequal class intervals. The computation of the arithmetic average of this kind of distribution is shown in the table. The unequal class intervals in this case may be due to the administrative requirements of the association. 404 BUSINESS STATISTICS The arithmetic-average purchase can be calculated directly by divid- ing the total sales by the number of purchases, just as was done in the distribution with equal class intervals. Care must be exercised, however, to use the correct midpoints of the classes. Short-cut method 1 may save time in calculation, and, as shown in columns 4 and 5 of Table 78, can be applied in this type of distribution just as easily as in one with equal class intervals. The step method is not commonly used in distributions with unequal classes. If it were employed in this calculation, columns 1, 2 and 4 would be unchanged. From column 4 it would be apparent that the width of the step should be $5. The step deviations would read 11, 7, 0, + 11, + 32, and + 61. The computation would be, M = $65.00- ($5) = $60.48. TABLE 78 METHOD OF CALCULATING THE AVERAGE VALUE OF PURCHASES OF ACTIVE PATRONS OP A CO-OPERATIVE ASSOCIATION, JANUARY 1 TO DECEMBER 31, 1937* VALUE or PURCHASES (dollars) NUMBER OF PUR- CHASERS (FRE- QUENCY) (2) MID- POINT X ^ FRE- QUENCY MID- POINT fX SHORT-CUT METHOD r, ( - 4) - Deviation from Assumed Average, $65.00 (S) Frequency X Deviation fd 0.00 and under 20.00. . 20.00 and under 40.00. . 40.00 and under 90.00 90.00 and under 150.00.. 150.00 and under 300.00. . 300 00 and over 248 140 202 74 49 11 10.00 30.00 65.00 120.00 225.00 370.00f 2,480 4,200 13,130 8,880 11,025 4,070 - 55 - 35 + 55 +160 +305 -13,640 4,900 + 4,070 + 7,840 + 3,355 Total 724 43,785 3,275 724 = $60.48 = 65.00 4.52 = $60.48 * From unpublished business reports of the association. t Average value obtained by dividing total sales in the interval by the number of purchases. The problem of the open-end distribution can be solved as indicated whenever the total of the open-end class is known, as it is in this case, or when the average value for the class is provided. Too frequently in published data neither of these values is known, so that, without em- ploying dangerous assumptions, it becomes impossible to calculate the arithmetic average. Distributions of this type are common in many MEASURES OF CENTRAL TENDENCY 405 kinds of census data. In such instances, it is necessary to employ one of the other measures of central tendency that depend upon position and do not make use of extreme values. These measures are described in chapter XVII. THE GEOMETRIC AVERAGE Definition The geometric average is a measure of central tendency which, like the arithmetic average, depends upon the values of all the items in the group. It is formally defined as the positive value of the nth root of the product of n positive items. This definition may sound very for- bidding, but the method of computation is relatively simple. In symbols, it becomes: Geometric mean = \/Xi X X 2 X X X Following the definition it is only necessary to extract the nth root of the product of the n items included. This work is greatly facilitated by the use of logarithms, the geometric average being simply the anti- logarithm of the arithmetic mean of the logarithms of all the items. The Average of Ungrouped Data The use of logarithms in computing the geometric average is illus- trated in Table 79. The logarithms come directly from the table in Appendix C. The sum of the logarithms divided by the number of items gives the logarithm of the geometric average, and the anti- logarithm is the geometric average. As shown at the bottom of the table the geometric average of these five numbers is 59.3, whereas the arithmetic average is 177.8. The arithmetic average is three times as great as the geometric average. This difference is the result of the greater importance of large values in the arithmetic average, which exceeds four of the five items. The geometric, on the other hand, has three items below it and two above. The example is intended to bring out the more representative character of the geometric as an average of values that are scattered as much as those in the table. The question of representativeness is related to the properties of the two averages. The arithmetic average is so located that the sum of the deviations of the individual values from it will be zero. One value that greatly exceeds the others provides a large deviation that offsets 406 BUSINESS STATISTICS TABLE 79 COMPUTATION OF THE GEOMETRIC AVERAGE OF FIVE NUMBERS NUMBEII Loci OF NUMBERS 6 77815 22 1.34242 50 1.69897 175 2.24304 636 2.80346 5)889 5)8.86604 log(7JVf. = 1.77321 M= 177.8 G.M. == 59.32 many small ones near the point of concentration of the data. Thus whenever a few exceptionally large values are included in the set, the arithmetic average will exceed the values of a majority of the individual items. The corresponding property of the geometric average is: the product of the ratios of the individual items to the average equals unity. In this computation an item one tenth as large as the average offsets an item ten times as large. For example, in Table 79 for the first item, 6, the ratio to the average is .101, and for the last item? 636, the ratio is 10.7. Therefore the two offset each other almost exactly in the computation of the average. In general, the geometric average should be used when a few large items destroy the representativeness of the arithmetic average. This situ- ation is particularly likely to arise in averaging ratios when most of them fall close to the lower limit of the available range, and a few have much higher values. The geometric average can frequently be employed to advantage in measuring average rates of change from one time period to another. The Average of a Frequency Distribution The geometric average can be computed from a frequency distribu- tion by a method very similar to that employed in calculating the arith- metic average. It is necessary to remember the basic assumptions of the grouping of data in a frequency distribution: that all items in each class interval are evenly distributed throughout the interval, and that for each interval a single value must be selected which is representative of all the values in the interval or which is equivalent to the average of the values of the items in the interval. The midpoint of each inter- MEASURES OF CENTRAL TENDENCY 407 val, which is equal to the arithmetic average of the class limits, was assumed to be the arithmetic mean of the values included in the interval. To be consistent in calculating the geometric average of a frequency distribution, the geometric average of the class limits should be used for this purpose, but the additional work involved in carrying out these calculations is not justified by the improvement in the results. As a consequence the class marks are assumed to be the geometric averages of the items in the several classes. A formula for finding the geometric average of a frequency distribu- tion can be developed from the corresponding arithmetic average for- mula by substituting logarithms of values for direct use of values, i.e., log GM . = Z/ in which X stands for the class marks and / for the corresponding frequencies. The anti-logarithm of the results obtained by performing the operations indicated on the right side of the equation is the geometric average. The steps in the process aie illustrated in Table 80 by the compu- tation of the average of the price relatives of 771 of the commodities included in the Bureau of Labor Statistics Index of Wholesale Prices. The relatives express the change produced in the United States price system by the outbreak of war in Europe. The average increase is 6.0 per cent. The significance of this change in a period of 30 days might be overlooked until one reflects that the prices in the table represent transactions approximating twenty billions of dollars monthly. Hence an increase of 6.0 per cent in price would add 1| billions of dollars to the exchange value of goods. The arithmetic average increase in the price relatives is 6.5 per cent. The difference between the two aver- ages is far from negligible in terms of the increase in exchange value of goods. In a distribution of ratios such as this, the arithmetic average places undue emphasis on the ratios above the peak of the distribution. This characteristic will be referred to in the discussion of index numbers as the inherent upward bias of the arithmetic average. A biased result can be avoided by the use of the geometric average. When distributions are approximately normal, however, this advantage of the geometric average tends to disappear because in such cases there will be very little difference between the two averages. 408 BUSINESS STATISTICS TABLE 80 CALCULATION OF GEOMETRIC AVERAGE FROM FREQUENCY DISTRIBUTION OF RELATIVES. WHOLESALE PRICE CHANGES IN 771 COMMODITIES, FROM AUGUST TO SEPTEMBER, 1939* PRICE RELATIVES (per cents) SEPT. 1939 -j- Auc. 1939 No. OF COMMODITIES / CLASS MARKS (per cents) X LOCX f LOGX Less than 94 5 4+ 7 66673 94.5- 99.5 ty 1 O7 1 98677 31 78832 99 5-100.5 *7fl 100 2 00000 756 00000 100 5-105.5 117 in* 2 01284 235 50228 105.5-110.5 -j< 108 2 0**42 152 50650 110.5-115.5 60 2 05308 123 18480 115.5-120.5 *7 118 2 07188 76 65956 120.5-125.5 2* 12* 2 08991 48 06793 125 5-130.5 22 128 2 10721 46 35862 130.5-135.5 l** 2 12385 31 85775 135.5-140.5 g 1*8 2 13988 17 11904 140.5-145.5 3 14* 2 15534 17 24272 145.5-150.5 2 148 2 17026 4 34052 150.5-155.5 4 153 2 18469 8 73876 155.5 and over 2 44l635 . . Total 771 1561.44988 i s~ AH 1561.44988 log G.M = = 2.02523 GM. = 106.0 per cent Prepared from Monthly Release, Average Wholesale Prices and Index Numbers of Indi- vidual Commodities, United States Bureau of Labor Statistics, August and September, 1939. t 94, 92, 88, 61. t 161, 162. 8 Sum of the logarithms of the individual relatives in this class. Characteristics of the Geometric Average Every value in a group of data must be included in the calculation of the geometric average and hence the value of this measure of central tendency cannot be influenced by individual judgment factors. In this respect, the geometric average and the arithmetic average are similar. They differ chiefly because in the geometric average small values have a greater effect than large ones, whereas the reverse is true in the arith- metic average. Consequently the arithmetic average is always larger than the geometric average. The geometric average can be employed only when all the values are positive, and none is zero. In view of this restriction, a geometric average percentage of profits and losses, for instance, cannot be cal- culated, for such data include plus and minus values. And a geometric average of percentage decreases and increases can be calculated only by expressing them ?s percentage relatives. MEASURES OF CENTRAL TENDENCY 409 The geometric average is not so easy to understand as the arithmetic average. Although the principle of the geometric average appears clear enough, its meaning is not easy to comprehend ; computation and practice in interpretation are essential before it becomes a readily avail- able tool in statistical analysis. The length of the computation and the lack of easily understood properties have been important factors in restricting its use as a measure of central tendency. PROBLEMS 1. Explain the difference between an unweighted and a weighted arithmetic average. 2. Six averages are computed from a set of values for variables, X, and a set of weights, W, as follows: X IF XXIF X IF XXIF X IF XXIF 7 10 70 24 10 240 5 1 5 24 1 24 12 2 24 15 10 150 5 12 60 7 4 28 24 12 288 15 2 30 5 12 60 7 2 14 12 4 48 15 1 15 12 4 48 5)63 29 )232 5)63 29 )367 5)63 29 )505 12.6 8 12.6 12.7 12.6 17.4 What is the relation of these averages to the discussion on pages 391-92? 3. One year ago an investor owned the following stocks and received the annual dividend returns as stated: STOCK INVFSTMENT DIVIDEND RETURN A $ 5 000 $300 B 12 000 480 C 2 000 160 Total $19 000 $940 Average return 4.95 per cent Today his stock holdings are as follows: STOCK INVESTMENT DIVIDEND RETURN A $ 8,000 $480 B 6,000 240 C 5,000 400 Total $19,000 $1,120 Average return 5.89 per cent 410 BUSINESS STATISTICS a) How are the average rates of return obtained? b) Inasmuch as none of the individual dividend rates has changed during the year, how do you explain the increase in the average return? 4. Compute the weighted average percentage of change in retail sales in the following lines of trade in Ohio in September, 1939, compared with August, 1939 (data from Table 74, page 394). Grocery with meats Department stores Motor vehicles Gasoline filling stations Restaurants Drugs 5. In accordance with the list on page 397 of the text compute the number of food units required for the following family: AGE Husband 32 Wife 28 Grandmother 60 Son 5 Daughter 1 6. On a single graph, plot the following two distributions of earnings in Hosiery Mill XYZ: WEEKLY EARNINGS SEMI-SKILLE n EMPLOYEES (dollars) Male Female 6.00 and under 10.00 o 21 10.00 and under 14.00 33 45 14.00 and under 18.00 91 56 18.00 and under 22 00 122 28 22.00 and under 26.00 74 12 26.00 and under 30 00 24 6 30.00 and under 34.00 4 34.00 and under 38.00 2 Total 350 168 7. a) Using data in Problem 6, find the average weekly earnings of (1) semi- skilled males, or of (2) semi-skilled females, using three different meth- ods of computation. Indicate all computations. b) From the shape of the distribution in Problem 6 and the value of the average in (1) or (2), whichever was assigned to you, state the characteristics of the distribution of earnings of either male or female hosiery workers. 8. a) (1) What is the arithmetic average income of the total families MEASURES OF CENTRAL TENDENCY 411 (3) Does the difference between the two averages indicate what the average annual cost of owning an automobile ought to be? Discuss. b) (1) Compute the percentage of families owning automobiles in each income group. (2) Compute the average of the percentages in (1). (3) 36,500 is what per cent of 68,200? (4) Does the result of either (2) or (3) give the percentage of families in this distribution that own automobiles? Explain. CAR OWNERSHIP BY U. S. FAMILIES HAVING INCOMES LESS THAN $5,000 BY INCOME GROUPS, DATA FROM A SURVEY IN 1933 IN 18 CITIES, BY THE U. S. DEPARTMENT OF COMMERCE INCOME GROUP (dollars) TOTAL No OF FAMILIES RFPORTING No. OF FAMILIES OWNING CARS 0_ 499 19400 5,800 500- 999 15,800 7,200 1 000-1 499 13 700 8600 1 500-1 999 9 300 6700 2 000-2 999 7 000 5,600 3,000-4 999 3 000 2 600 Total 68,200 36 500 9. The number of stores operated by each of eight retail variety chains in 1936 was: CHAINS No. OF STOBES, Nov. 1936 m W. T. Grant & Co 477 H. L. Green Co., Inc 134 S. S. Kresge Co 731 S. H. Kress & Co 235 McCrory Stores Corp 194 G. C. Murphy Co 194 J. C. Penney Co 1,496 F. W. Woolworth Co 1,995 * Survey of Current Business, January, 1937. a) Compute the arithmetic average number of stores per chain. b) Compute the geometric average number of stores per chain. c) Explain why the geometric average is superior for these data. 10. a) Using the data in columns (A), (B), (C), (D), and (E) of Table 55, page 294, compute an unweighted geometric average index for each year. b) Compare your results with the weighted arithmetic average index appearing at the lower right of the table. c) Discuss the merits of each index as a measure of the credit standing of a prospective borrower. 412 BUSINESS STATISTICS 11. a) From the following table, compute the arithmetic average and the geometric average: PER CENT DISTRIBUTION OF INDUSTRIAL ESTABLISHMENTS IN THE UNITED STATES, ACCORDING TO VALUE OF PRODUCTS, 1925 VALUE OF PRODUCTS (dollars) PERCENTAGE OF TOTAL ESTABLISHMENTS 5 000- 20,000 30 20 000- 100 000 37 100 000- 500 000 22 500 000- 1 000 000 5 1 000 000- 2 000 000 6 Total . . . 100 b) Which average is more representative for the data? Give reasons. CHAPTER XVII MEASURES OF CENTRAL TENDENCY AVERAGES OF POSITION THE preceding chapter was devoted to the discussion of those measures of central tendency that depend upon calculation processes. This chapter proceeds with the description of meas- ures that are determined by their positions in a given set of data and hence require the exercise of the computer's judgment. There are two commonly used measures of this type, (1) the median and (2) the mode. THE MEDIAN Whether in an array or a frequency distribution, the median is the value of the middle item. Expressed more precisely, it is the value at the point on either side of which there is an equal number of items. That is, when the number of items is uneven the median has the value of the middle item ; when the number is even it lies half way between the two items at the center. Location and Value of the Median in an Array For a given set of data the location of the value of the median will be at the same item or between the same two items, whether the value of each item is known separately or whether they are grouped in a frequency distribution. Slightly different procedures are needed in the two cases, however, to fix the location of the median item and to determine its value. A simple diagram will explain the reason for this difference. Suppose that we wish to find the middle point of a distance of 5 miles. Obviously it is located at 2.5 miles, computed by taking f-, or % if N = the number of miles. But if we have 5 men, one located at the center of each mile, the middle man is not the 2.5th but the 3d. From Figure 59, it is clear that each man's number is .5 more than the mileage at the point where he is standing. This is because the men are located and numbered at the center of each space, whereas each milestone is at the end of the space it measures. Thus 413 414 BUSINESS STATISTICS the 3d man stands at the 2.5th mile and the two medians coincide although they are designated differently. FIGURE 59 LOCATION OF THE MEDIAN IN AN ARRAY MEDIAN MAN 1 2 3 4 5 5 MEN | 1 1 1 1 5 MILES fe\ [ ^ (il V dl C ll (Si MEDIAN DISTANCE Application of Formula. When individual items are arrayed they correspond to the 5 men in Figure 59. Therefore in order to find the middle item it is necessary to add .5 of an item to the fraction JT This is accomplished by adding 1 to the numerator. Thus the formula for locating the median item in an array is -*-. Its value is then simply that of the middle item or the average of the two central items. The usefulness of the formula in finding the middle one of a large number of items is illustrated in Table 81. The formula for the median item is: N + l 63 + 1 _ 2~ = 2~~~ 33 Counting from either end of the array, the value of the 33d wage, $53.35, is the median. If there had been 64 pay checks in the group (omitting the last one, $80.12) the solution by the formula would show the median to be the 32.5th item. It would then be a value half way between the 32d and 33d items and would be equal to one-half the sum (the arithmetic average) of the values of the 32d and 33d wages, i.e., ($52.93 + $53.35)^ 2 = $53.14 = the median. This value ($53.14) is not the value of any single wage in the array, but is the value half way between the two central items and on each side of which there is an equal number of items. The Extended Median. It is obvious that a measure that is deter- mined by the value of a single item or the average of two items has some unreliability, especially if there is only a small number of items in a sample. Suppose that an array consists of the five items, 5, 7, 15, 17, and 18; the median is 15. If two lower values, 2 and 3, are added, the median is 7 ; it has been reduced by 8 units, whereas if two higher values, 20 and 24, had been added instead, it would have been MEASURES OF CENTRAL TENDENCY 413 TABLE 81 AKRAY OF WEEKLY WAGES OF 65 SKILLED RUBBER WORKERS IN A FACTORY IN MICHIGAN* $36.96 38.84 38.96 41.12 47.02 47.99 49.07 49.29 49.35 49.43 49.43 49.43 50.03 51.16 51.90 51.92 $51.92 51.93 51.96 52.48 52.62 52.73 52.73 52.73 52.73 52.73 52.83 52.83 52.83 52.83 52.92 52.93 $53.35(median) 53.43 53.58 53.66 53.73 53.93 54.32 54.74 55.52 56.31 56.43 56.43 56.43 56.43 56.43 56.43 $56.43 58.34 58.58 59.13 60.01 60.12 61.36 62.69 63.45 68.62 68.62 71.34 73.42 73.49 73.49 78.82 80.12 ' Confidential unpublished source. increased by only 2 units, to 17. In a small group therefore, unless the values are closely concentrated, the central item may be shifted so much by the addition of a very few items at either end that the median is too erratic to be depended upon. In such a case it is possible to resort to a more stable measure, the extended median. This is obtained by averaging the 3, 4, or 5 central items instead of taking a single item or the average of only two of them. Using the same example that was suggested above, the extended median, of 5, 7, 15 17, and 18 is T + ^ + 17 = 13. Adding the two lower values, the extended median of 2, 3, 5, 7, 15, 17, and 18 is g + 7 8 + 15 -9; adding the two higher values, the extended median of 5, 7, 15, 17, 18, 20, and 24 is IB + IT + is _- jg Thus the extended median fluctuated 4 points on one side of the center and 3$ points on the other side when two more items were added at the ends of the series, whereas the ordinary median fluc- tuated 8 points on one side and 2 on the other. As the number of items averaged at the center is increased, the fluctuations will become smaller and more even on either side. This measure is particularly valuable in determining the seasonal pattern of a time series, a purpose for which the median value of each month over a period of years is very commonly used. Speaking of the use of the extended median for obtaining seasonal variations, Chris- topher Saunders, of the University of Manchester, has said: "The extended median is probably the best average to take, since it is not influenced by extreme values which may be due to exceptional circum- 416 BUSINESS STATISTICS stances, and at the same time is not, like the simple median, affected by the accident of the value in any single year." l Location and Value of the Median in a Frequency Distribution In order to explain why the determination of the median in a frequency distribution differs from that in an array, it is necessary to go back to Figure 59. The 5 miles correspond to the range of unit values of class intervals in a frequency distribution. But instead of having one item in the center of each mile, or interval, there may be any number of items. In the absence of detailed information to the contrary, these are assumed to be evenly distributed throughout the interval, each one being at the center of its own "item range." Figure 60 illustrates a very simple frequency distribution made up of FIGURE 60 LOCATION OF THE MEDIAN IN A FREQUENCY DISTRIBUTION [A] 10 WAGE ITEMS OR FREQUENCIES'* 2 5 3 [B] ASSUMED DISTRIBU- TION OF ITEMS "" [Cl MEASURE OF ITEM RANGES ~" ( 1 2 * 1 * 34 .1.1. 5 6| 7 8 . 9 1 10 ) 234567 8 9 10 MEDI AN - $56 [D] MEASURE OF 40 42 44 UNIT VALUES ~ 40 4 , 2 4 , 4 1 46 48 60 52 54 56 58 60 62 64 66 11 1111 ill 68 70 72 74 76 78 80 i i i i i i [E] 3 CLASS INTERVALS- $40 TO $50 $50 TO $60 $60 TO |80 WAGES NO OF EMPLOYEES $40-50 2 50-60 5 60-80 3 10 10 weekly wages selected from Table 81 and grouped in three class intervals. If we knew the value of each of these 10 items, we could determine the location of the median from the formula for an array, v_+ 12^1 = 5.5, and its value would be half way between the 5th and 6th items (row B). But, since we know nothing except the class values and the number of items in each class, some other method must be found for determining an approximate value for the median. Application of Formula. The logical procedure is to interpolate from the assumed distribution of items in the class in which the median item falls to the unit values of that class. The median is still the value midway between the 5th and 6th items but, in order to interpolate on the diagram, the values of the items as well as of the 1 Christopher Saunders, "Seasonal Variations in Employment," The Economic Journal, Vol. XLV, No. 178 (June 1935), P. 272. MEASURES OF CENTRAL TENDENCY 417 class intervals must be measured along a scale. Hence we use the "milestones" (row C) that mark the ends of the "item ranges" 2 instead of the numbers (row B) that count the items. The center of the 10 item ranges is therefore ~- = ^ = 5, and it will be seen from Figure 60 that this point in row C coincides with the point midway between the 5th and 6th items in row B. A line drawn from the end of the 5th item range intersects the scale of unit values (row D) at $56, which is therefore the value of the median wage for this group of 10 items. By computation, the class in which the median item will fall is determined by cumulating the frequencies until they exceed the value of -- . In this case -1=5. This exceeds 2, the num- ber of frequencies in the lowest class interval, but is less than 2 + 5, the sum of the frequencies in the first two classes; hence the median falls in the 2d class, $50 to $60. The proportionate distance of the median value from the lower limit of the median class interval will be the same as the number of its item range in the interval is to the total number of item ranges in that interval. To dete